knime in nibr stories from industry

Post on 01-Oct-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Gregory Landrum NIBR IT Novartis Institutes for BioMedical Research, Basel

5th KNIME Users Group Meeting

Zurich, 2 February 2012

KNIME in NIBR: Stories from Industry

Basel, Switzerland

Basel, Switzerland

KNIME in NIBR

§  Infrastructure

§  Node development • Open-source & in-house •  Sponsored

§  Examples

2

Infrastructure

§  Enterprise servers + cluster integration running in Cambridge, Basel

§  Standard releases for Windows, Linux, Mac

§  Nightly builds for users comfortable on the bleedingleading edge

3

Node development : open source

§  Chemistry nodes based on the RDKit •  open-source cheminformatics toolkit •  useable from C++, Python, Java

•  NIBR scientists/developers actively participate •  www.rdkit.org

§  Standard cheminformatics tasks + some nice extras

§  Developed both in-house and together with knime.com

4

Node development : in house

§  Connections to internal data sources

§  Wrappers around in-house developed algorithms

§  Connection to our web service framework for cheminformatics services

5

Generic CIx service node

6

Sponsored node development

§  Modifications to naïve Bayes nodes to support fingerprints

§  Fingerprint naïve Bayes supporting unbalanced datasets

§  Database schema browser

§  Improvements to python integration

§  Improvements to database connector, readers

§  Ensemble tree classifier (in progress)

7

Case studies

8

Combining databases

9

§  Question: what kind of activity might I expect to see for a given compound?

§  Do a similarity search in our database of internal compounds

§  Look up assays where those compounds have been tested

§  More browsing of those results: where are those neighbors most active?

p(Activity) > 8

Combining databases

p(Activity) > 8

Combining databases

11

§  More browsing of those results: show me the most active neighbors

Parallel virtual screening example

§  Goal: find some interesting compounds to be screened for a new project

§  2D similarity searches across two databases: •  NIBR powder archive •  Catalogs from trusted vendors

§  About 7 million compounds total.

§  Use several different fingerprints

Finton Sirockin (GDC/CADD)

The basic process

13

§  Generate fingerprints for database and queries

§  Calculate similarities with the Erlwood Fingerprint Similarity node

§  Sort, filter, standardize

§  Report

Combining the pieces

14

• Workflow is run for each query

• Fingerprints calculated for each type of search

• 600 – 11 000s • Needs to be calculated only once, even for n queries

Cluster usage reporting

§  Present a dashboard with a comprehensive view of current and historical usage of our HPC cluster infrastructure

§  Three Phases of processing : •  Input from raw SGE files off of the clusters at each site •  Steps A-C : data pre-processing, filtering & date-time object conversion

-  All logs are gathered into a single file kept in RAM -  Use of java nodes to convert unix time to Knime date objects -  Bash nodes for awk manipulations which are faster natively in LINUX

•  Steps D – I : execute concurrently -  Knime Statistics and grouping are heavily used -  Step H spawns cluster jobs to gather user usage statistics

§  Present summarized and aggregated data using spotfire

15

Mike Derby (NIBR IT) Varun Shivashankar (NIBR IT)

The workflow

16

•  Usage Data input file : Original logs 2GB – 4 GB in size x 4 clusters

•  Resulting Data file of summarized data : user_usage_DUS.csv == 1.9M

The complexity

17

The report: historical data

18

The dashboard

19

Written out to a UNC path, read every few minutes by Spotfire Server Generates data either from scripts or Knime running headless.

Predicting which target a molecule will hit

§  Goal: build a model to predict which of a set of targets a molecule is most likely to hit

§  Method: using RDKit atom-pair fingerprints and a new KNIME learner that builds ensembles of truncated decision trees. (sponsored development with knime.com)

§  Validation data set: active molecules from 50 different ChEMBL assays1

20

1Heikamp, K. & Bajorath, J. Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets. J. Chem. Inf. Model. 51, 1831-1839 (2011).

Predicting which target a molecule will hit

21

§  11561 data points, 50 classes

§  50 trees, random descriptor selection

About that scaling…

22

Predicting which target a molecule will hit

23

§  11561 data points, 50 classes

§  50 trees, random descriptor selection

§  out-of-bag prediction error: 5.8%

§  mean error from cross validation: 4.2%

Predicting which target a molecule will hit

24

§  mistakes tend to be in families

Drilling into the confusion matrix

25

Drilling into the confusion matrix

26

Drilling into the confusion matrix

27

Drilling into the confusion matrix

28

Drilling into the confusion matrix

29

Drilling into the confusion matrix

30

Acknowledgements

§  NIBR •  John Davies (CPC) •  Richard Lewis (GDC) •  Steve Litster (NIBR IT) •  Andy Palmer (NIBR IT) •  Patrick Warren (NIBR IT) •  Case studies

-  Finton Sirockin (GDC) -  Mike Derby (NIBR IT) -  Varun Shivashankar (NIBR IT) -  John Davies (CPC)

•  Node development -  Manuel Schwarze (NIBR IT) -  Dillip Kumar Mohanty (NIBR IT) -  Sudip Ghosh (NIBR IT)

•  Marc Litherland (NIBR IT)

§  knime.com •  Michael Berthold •  Bernd Wiswedel •  Thorsten Meinl •  Peter Ohl

§  Simon Richards (Lilly)

31

T e a c h • D i s c o v e r • T r e a t

the power of collaborative efforts

Join the Teach-Discover-Treat initiative: participate in our

symposium* and compete on one or more challenges!

*ACS Spring Meeting, March 25th, 1:30pm to 5:00pm, San Diego Convention Center, Room 26A

Goal: Provide high quality computational chemistry tutorials that impact education and drug discovery for neglected diseases

q  Requirements: use freely available software tools; datasets will be provided with a focus on targets for neglected diseases

q  Criteria to judge: quality of the model (statistical measures), clarity of the tutorial (suitable for undergraduate course), innovative application of computational technique(s)

q  Awards: travel awards to cover travel expenses for presenting work at COMP symposium

q  Presentation of Awardees at ACS Spring 2013 meeting (New Orleans)

More information and access to data sets coming in March Bookmark www.teach-discover-treat.org

top related