migrating the emolecules chemical-search engine to chemaxon€¦ · rendering powered by marvin js....

25
Migrating the e Molecules Chemical - Search Engine to ChemAxon Craig A. James CTO, eMolecules, Inc.

Upload: others

Post on 06-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Migrating the eMolecules

Chemical-Search Engine

to ChemAxon

Craig A. James

CTO, eMolecules, Inc.

Page 2: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Search and Order Chemical Compounds

• 10M products:• 2M Building-Block

Compounds

• 8M Screening Compounds

• 600K Antibodies

• No “virtual” compounds

• Quality Data:• Supplier relationships

• Fast catalog updates

• Data normalization,

cleanup

• Data curation/QA

• Supplier “tiers”

What is eMolecules?

Page 3: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

About eMolecules: Searching

• Search

• Substructure

• Similarity

• Exact

• Refine

• Browse

• Filter

• Sort

• Save/Reopen

• Collaborate

• Import list

• Export list

Page 4: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

About eMolecules: Shopping Cart

• Accurate

• Up-to-date

prices

• Accurate lead

times

• Curated data

• Timely

• Suppliers’ on-

time delivery

monitored daily

Page 5: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

About eMolecules: Custom Web Sites

• 100 private web sites:• Pharma, biotech and

biology customers

• Customized: customer’s

preferred suppliers and

“skin”

• Secure: access

restrictions, verified

logins

Page 6: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

• Web Services allow

our partner to present

eMolecules structure

and catalog search

• JAGGAER (SciQuest)

Enterprise Reagent

Manager (ERM)

federates inventory

search with

eMolecules to simplify

sourcing and

management of

materials for research

About eMolecules: Partnering

JAGGAER ERM now offers structure drawing and

rendering powered by Marvin JS

Page 7: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

eMolecules’ Cheminformatics Engine

• eMolecules’ Fast-Search Engine

• Fast: most queries return in 0.1 to 2 seconds

• Stateless: answer query then forget user

◦ Each subsequent query (e.g. “page 2”) is a new

search.

• Efficient pagination: No overhead

◦ “Get rows 1-10 and stop”

◦ “Get rows 11-20 and stop” … without re-searching

for rows 1-10.

Page 8: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

eMolecules’ Cheminformatics Engine

• eMolecules’ Power-Search: Hitlists

• Long-running server-side searches (launched

by browser, run to completion)

• Progress reports

• Interrupt/Cancel

• Refinement (“Filtering”) to narrow results

• Multiple stored hitlists

• List operations (merge, filter, union,

intersection, etc.)

Page 9: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

2016: Replace Home-Grown Searching

• Why replace eMolecules’ home-grown

cheminformatics?

• Commercial products (e.g. ChemAxon) are

superior

• Available as “cartridge” in Postgres open-

source relational database

• Cost of maintaining compatibility with

OpenBabel open-source software project

Page 10: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Relational Databases and the Web

• Web: Stateless, fast, lightweight• Lots of fast queries, pagination, load spikes

• Guest users with no stake in results

• Abusive robots, industrial theft, criminals

• RDB: Stateful, unpredictable, overhead• Never “instantaneous” like Google or Bing – queries can

take minutes, even longer

• Poor pagination

• No refinement of search results

• No progress report: Will my search be done in 10 seconds or 2 hours??

• Hard to interrupt/cancel a search

• Result: User disappointment

Page 11: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Background: What RDBs do well

• “Does Enamine sell CAS# 5398-77-6?”

=> explain analyze select compound_id, cas_number from sample

where cas_number = '5398-77-6' and supplier_id = 681;

-> BitmapAnd

cost est: 555.00

actual time: 8.99

-> Bitmap Index Scan on i_sample_casno_8947

cost est: 0.00 to 5.77 rows=178

actual time: 0.023 rows=38

-> Bitmap Index Scan on i_sample_supplier_id_8947

cost est: 0.00 to 548.98 rows=20339

actual time: 8.96 rows=96626

Execution time: 13.056 ms

Page 12: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Background: What RDBs do well

• RDBs handle this query really well

• Statistics on each column of numbers are

highly predictive

• Good prediction Good execution plan

• This doesn’t work for chemical queries

• Statistics on chemical columns are poor

predictors

• Poor prediction Poor execution plan

Page 13: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Problem: Instant Results

• Searching millions of compounds

• Our test cases: “Find 1st page of results”

Selective Query: 3 hits“Stupid” Query: 1.3M hits

Page 14: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Selective query works well

=> explain analyze select count(1) from registry.smiles_indexed

where smiles |>| 'CN1CCN(CC1)S(=O)(=O)c2ccc(OCC)c(C(=O)O)c2' limit 10;

Limit

cost est: 62247.90 to 62247.91

actual time: 178.05

-> Aggregate

cost est: 62247.90 to 62247.91

actual time: 178.04

-> Index Scan using i_smiles_indexed_smiles on smiles_indexed

cost est: 102.60 to 60230.04 rows=807144

actual time: 177.577 rows=3

Page 15: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

“Stupid” query is a disaster

=> explain analyze select count(1) from registry.smiles_indexed

where smiles |>| 'c1ccccc1' limit 10;

Limit

cost est: 62247.90 to 62247.91

actual time: 30787.81

-> Aggregate

cost est: 62247.90 to 62247.91

actual time: 30787.81

-> Index Scan using i_smiles_indexed_smiles on smiles_indexed

cost est: 102.60 to 60230.04 rows=807144

actual time: 24753.99 rows=1318621

Page 16: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Problem: Instant Results

• Ordinary SQL strategy won’t work

• Optimizer predictions are wrong for chemistry

• Query times extremely variable

• Even if you give hints, restructure your SQL,

and get it to work, you’re still in trouble

◦ RDB optimizers change/improve with every

software release

◦ Query plans can change unexpectedly as DB grows

◦ What works today may fail tomorrow!

Page 17: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Historical Note: eMolecules Technique

5398-77-6 MFCD00216485Dibenzofuran

Chemical

Query?

eMolecules

Chemical

Search

Postgres

Database

Yes

No

• How did eMolecules’ system handle this?

Page 18: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Historical Note: eMolecules Technique

• The eMolecules Chemical Search Engine

• Not a general-purpose database

• No query planning

• Exactly one query: Find the next N rows

• Dynamic optimization: search speeds up as it

proceeds

• Execution state (“Where did the last search

stop?”) encoded in a few bytes that can be

embedded in a URL

Page 19: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Solution: Postgres Partitioning

BB_molecules(parent table – empty)

BB_molecules_01(child table – 5%)

BB_molecules_20(child table – 5%)…

Query entire data set

(10 million compounds)

Query subsets

(500,000 compounds)

Page 20: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Solution: Postgres Partitioning

• Benefits of table partitions:

• Initial web query applied to just one partition

• Only query full data set (all partitions) if user

is interested in full result set

• User requests for full result set are only a

small fraction of actual cases

Page 21: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Partitioning Performance

• Advantage: “Stupid” query improved 95%

• Disadvantage: Selective queries slowed

• Very specific query: 3 hits• Search 2M: 85 msec

• Search 2M in 20 parts: 128 msec

• 43 msec slowdown (insignificant)

• Benzene query: 1st ten rows

from 1.3M hits:• Search full 2M: 28 seconds

• Search 5%: 1.0 seconds

• 30x speedup

[NO NAME]

CHEMW2 03241715442D

Created with ChemWriter - http://chemwriter.com

22 23 0 0 0 0 999 V2000

-7.7482 0.1596 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0

-7.0337 -0.2529 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-6.3192 0.1596 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-6.3192 0.9846 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0

-7.0337 1.3972 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-7.7482 0.9846 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-5.6047 1.3972 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0

-4.8902 0.9846 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-6.0172 2.1117 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0

-5.1922 2.1117 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0

-4.8902 0.1596 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-4.1757 -0.2529 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-3.4612 0.1596 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-3.4612 0.9846 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-4.1757 1.3972 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-2.7467 -0.2529 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0

-2.7467 -1.0779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-2.0322 -1.4905 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-4.1757 -1.0779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

-4.8902 -1.4905 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0

-3.4612 -1.4905 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0

-8.4627 -0.2529 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

1 2 1 0 0 0 0

2 3 1 0 0 0 0

3 4 1 0 0 0 0

4 5 1 0 0 0 0

5 6 1 0 0 0 0

6 1 1 0 0 0 0

4 7 1 0 0 0 0

7 8 1 0 0 0 0

7 9 2 0 0 0 0

7 10 2 0 0 0 0

8 11 2 0 0 0 0

11 12 1 0 0 0 0

12 13 2 0 0 0 0

13 14 1 0 0 0 0

14 15 2 0 0 0 0

15 8 1 0 0 0 0

13 16 1 0 0 0 0

16 17 1 0 0 0 0

17 18 1 0 0 0 0

12 19 1 0 0 0 0

19 20 1 0 0 0 0

19 21 2 0 0 0 0

1 22 1 0 0 0 0

M END

Page 22: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Power Search: eMolecules Hitlists

• Search results are stored in a “hitlist”

• Progress report during search (via partitioning)

• Persistent results (saved in database)

• Fast perusal of results (no further searching)

• Sorting, refining results

• Cancel a search

• A demo…

Page 23: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Implementation: RESTful JSON API

• Database layer separated from application

Cheminformatics

App Server

Web

App

Server

HTTP HTTP

RESTful

JSON

API

Postgres:

Supplier’s

Catalog Data

Postgres:

ChemAxon

Chemical Search

Page 24: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

Summary

• RDBs: poorly suited to web expectations

• Cheminformatics makes it even worse

• Solution: partitioning and hitlists

• Lightweight queries◦ “Instant” response

◦ Efficient, stateless pagination

• Power queries◦ Long-running queries with progress reports

◦ Refine, sort, and merge results

◦ Interrupt/cancel features

Page 25: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:

RESULTS …

Result: A Web-capable Relational

Cheminformatics Database

Questions?

Craig A. James

CTO, eMolecules, Inc.