exploring large chemical data sets

21
Exploring Large Chemical Data Sets Interactive Analysis and Visualization Kyle Lutz and Marcus D. Hanwell August 21, 2012 Skolnik Symposium

Upload: kylelutz

Post on 19-Jun-2015

1.658 views

Category:

Technology


4 download

DESCRIPTION

Exploring Large Chemical Data Sets: Interactive Analysis and Visualization

TRANSCRIPT

Page 1: Exploring Large Chemical Data Sets

Exploring Large Chemical Data Sets

Interactive Analysis and Visualization

Kyle Lutz and Marcus D. Hanwell

August 21, 2012Skolnik Symposium

Page 2: Exploring Large Chemical Data Sets

● An open-source, cross-platform cheminformatics tool

● A general-purpose tool for chemical data exploration and analysis

● Interactive, editable and queryable database of chemical data on the desktop

● Part of the Open Chemistry application suite (Avogadro and MoleQueue)

● Leverages several open-source projects: Qt, VTK, Chemkit, Open Babel, MongoDB

Overview

Page 3: Exploring Large Chemical Data Sets

Architecture● Native, cross-platform C++ application built with Qt● Stores chemical data in a NoSQL MongoDB database● Uses VTK for 2D and 3D data set visualization

Page 4: Exploring Large Chemical Data Sets

Main Window

Page 5: Exploring Large Chemical Data Sets

Molecule Details

Page 6: Exploring Large Chemical Data Sets

Queries

Supports different queries:● Name● Formula● InChI● InChIKey● Structure and

Substructure

Page 7: Exploring Large Chemical Data Sets

Similarity Searching

Page 8: Exploring Large Chemical Data Sets

Charts and Plots

Histogram of logPScatter Plotof Polar Surface Area (TPSA)

against Volume (VABC)

Page 9: Exploring Large Chemical Data Sets

Multidimensional Analysis

● Provide tools for viewing and analyzing large amounts of data with multiple dimensions○ Scatter Plot Matrix○ Parallel Coordinates○ K-Means Clustering

● Interactive charts supporting selection● Easy to add new chemical descriptors

Page 10: Exploring Large Chemical Data Sets

Scatter Plot Matrix

Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume

Page 11: Exploring Large Chemical Data Sets

Parallel Coordinates

Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume

Page 12: Exploring Large Chemical Data Sets

K-Means Clustering

● ~30 numeric molecular descriptors● 1D, 2D, and 3D visualization● Selection and extraction of molecules from clusters

Page 13: Exploring Large Chemical Data Sets

Similarity Visualization

● Similarity Clustering● Calculated from fingerprint similarity or structural

similarity

Page 14: Exploring Large Chemical Data Sets

Similarity Visualization

30%

45%

60%

Page 15: Exploring Large Chemical Data Sets

ChemicalJSON

Specification avaialble at: http://wiki.openchemistry.org/Chemical_JSON

Example: ethane.cjson

● JSON (JavaScript Object Notation) is a "lightweight data-interchange format"

● Store molecular structure, geometry, identifiers and descriptors all as a single JSON object

● Benefits:○ More compact than XML/CML○ Native language of MongoDB and

JSON-RPC○ Easily converted to a binary

representation (BSON)

Page 16: Exploring Large Chemical Data Sets

ChemicalJSON in MongoDB

● Nearly identical to what is stored in a file○ A few extra fields stored

■ 2D diagram (as PNG)■ Heavy atom count (for substructure searching)■ Binary fingerprints (for similarity searching)■ InChIKey for indexing and as a unique key■ Mongo's OID ("_id") field

● Trivial to write out to a .cjson file: db.molecules.find({"name" : "ethanol"}, {"diagram" : 0, "heavyAtomCount" : 0, "fp2_fingerprint" : 0, "_id" : 0})

Page 17: Exploring Large Chemical Data Sets

Open Chemistry with ParaViewWeb● Uses ParaView's client-server architecture● Interactive 3D rendering● Runs in any modern web browser

URL: http://paraviewweb.kitware.com/OpenChemistry/

Page 18: Exploring Large Chemical Data Sets

Open Chemistry with ParaViewWeb

ChemData

Page 19: Exploring Large Chemical Data Sets

● Uses JSON-RPC to communicate with other applications (most notably Avogadro)

● Visualize data directly from the database● Uses ChemicalJSON to represent molecular

structures and transfer molecular information

RPC / Avogadro Integration

Page 20: Exploring Large Chemical Data Sets

Future Directions

● Direct integration with 3rd party databases (PubChem, PDB, ...)

● Broader support for storing and analyzing computational job results○ Linked with molecular structures○ Direct from CML or converted/parsed

● Plugins to facilitate extension○ Descriptors○ Visualization○ Chemical file input/output

● Scaling studies, working with multiple data servers and terabytes of data

Page 21: Exploring Large Chemical Data Sets

Comments/Questions?Home Page

http://wiki.openchemistry.org/ChemData

Source Codehttps://github.com/OpenChemistry/chemdata

ParaViewWeb Demohttp://paraviewweb.kitware.com/OpenChemistry