exploring large chemical data sets
DESCRIPTION
Exploring Large Chemical Data Sets: Interactive Analysis and VisualizationTRANSCRIPT
![Page 1: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/1.jpg)
Exploring Large Chemical Data Sets
Interactive Analysis and Visualization
Kyle Lutz and Marcus D. Hanwell
August 21, 2012Skolnik Symposium
![Page 2: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/2.jpg)
● An open-source, cross-platform cheminformatics tool
● A general-purpose tool for chemical data exploration and analysis
● Interactive, editable and queryable database of chemical data on the desktop
● Part of the Open Chemistry application suite (Avogadro and MoleQueue)
● Leverages several open-source projects: Qt, VTK, Chemkit, Open Babel, MongoDB
Overview
![Page 3: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/3.jpg)
Architecture● Native, cross-platform C++ application built with Qt● Stores chemical data in a NoSQL MongoDB database● Uses VTK for 2D and 3D data set visualization
![Page 4: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/4.jpg)
Main Window
![Page 5: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/5.jpg)
Molecule Details
![Page 6: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/6.jpg)
Queries
Supports different queries:● Name● Formula● InChI● InChIKey● Structure and
Substructure
![Page 7: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/7.jpg)
Similarity Searching
![Page 8: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/8.jpg)
Charts and Plots
Histogram of logPScatter Plotof Polar Surface Area (TPSA)
against Volume (VABC)
![Page 9: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/9.jpg)
Multidimensional Analysis
● Provide tools for viewing and analyzing large amounts of data with multiple dimensions○ Scatter Plot Matrix○ Parallel Coordinates○ K-Means Clustering
● Interactive charts supporting selection● Easy to add new chemical descriptors
![Page 10: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/10.jpg)
Scatter Plot Matrix
Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
![Page 11: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/11.jpg)
Parallel Coordinates
Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
![Page 12: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/12.jpg)
K-Means Clustering
● ~30 numeric molecular descriptors● 1D, 2D, and 3D visualization● Selection and extraction of molecules from clusters
![Page 13: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/13.jpg)
Similarity Visualization
● Similarity Clustering● Calculated from fingerprint similarity or structural
similarity
![Page 14: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/14.jpg)
Similarity Visualization
30%
45%
60%
![Page 15: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/15.jpg)
ChemicalJSON
Specification avaialble at: http://wiki.openchemistry.org/Chemical_JSON
Example: ethane.cjson
● JSON (JavaScript Object Notation) is a "lightweight data-interchange format"
● Store molecular structure, geometry, identifiers and descriptors all as a single JSON object
● Benefits:○ More compact than XML/CML○ Native language of MongoDB and
JSON-RPC○ Easily converted to a binary
representation (BSON)
![Page 16: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/16.jpg)
ChemicalJSON in MongoDB
● Nearly identical to what is stored in a file○ A few extra fields stored
■ 2D diagram (as PNG)■ Heavy atom count (for substructure searching)■ Binary fingerprints (for similarity searching)■ InChIKey for indexing and as a unique key■ Mongo's OID ("_id") field
● Trivial to write out to a .cjson file: db.molecules.find({"name" : "ethanol"}, {"diagram" : 0, "heavyAtomCount" : 0, "fp2_fingerprint" : 0, "_id" : 0})
![Page 17: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/17.jpg)
Open Chemistry with ParaViewWeb● Uses ParaView's client-server architecture● Interactive 3D rendering● Runs in any modern web browser
URL: http://paraviewweb.kitware.com/OpenChemistry/
![Page 18: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/18.jpg)
Open Chemistry with ParaViewWeb
ChemData
![Page 19: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/19.jpg)
● Uses JSON-RPC to communicate with other applications (most notably Avogadro)
● Visualize data directly from the database● Uses ChemicalJSON to represent molecular
structures and transfer molecular information
RPC / Avogadro Integration
![Page 20: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/20.jpg)
Future Directions
● Direct integration with 3rd party databases (PubChem, PDB, ...)
● Broader support for storing and analyzing computational job results○ Linked with molecular structures○ Direct from CML or converted/parsed
● Plugins to facilitate extension○ Descriptors○ Visualization○ Chemical file input/output
● Scaling studies, working with multiple data servers and terabytes of data
![Page 21: Exploring Large Chemical Data Sets](https://reader035.vdocuments.us/reader035/viewer/2022081720/5583d130d8b42a6b638b4f85/html5/thumbnails/21.jpg)
Comments/Questions?Home Page
http://wiki.openchemistry.org/ChemData
Source Codehttps://github.com/OpenChemistry/chemdata
ParaViewWeb Demohttp://paraviewweb.kitware.com/OpenChemistry