migrating the emolecules chemical-search engine to chemaxon€¦ · rendering powered by marvin js....
TRANSCRIPT
Migrating the eMolecules
Chemical-Search Engine
to ChemAxon
Craig A. James
CTO, eMolecules, Inc.
Search and Order Chemical Compounds
• 10M products:• 2M Building-Block
Compounds
• 8M Screening Compounds
• 600K Antibodies
• No “virtual” compounds
• Quality Data:• Supplier relationships
• Fast catalog updates
• Data normalization,
cleanup
• Data curation/QA
• Supplier “tiers”
What is eMolecules?
About eMolecules: Searching
• Search
• Substructure
• Similarity
• Exact
• Refine
• Browse
• Filter
• Sort
• Save/Reopen
• Collaborate
• Import list
• Export list
About eMolecules: Shopping Cart
• Accurate
• Up-to-date
prices
• Accurate lead
times
• Curated data
• Timely
• Suppliers’ on-
time delivery
monitored daily
About eMolecules: Custom Web Sites
• 100 private web sites:• Pharma, biotech and
biology customers
• Customized: customer’s
preferred suppliers and
“skin”
• Secure: access
restrictions, verified
logins
• Web Services allow
our partner to present
eMolecules structure
and catalog search
• JAGGAER (SciQuest)
Enterprise Reagent
Manager (ERM)
federates inventory
search with
eMolecules to simplify
sourcing and
management of
materials for research
About eMolecules: Partnering
JAGGAER ERM now offers structure drawing and
rendering powered by Marvin JS
eMolecules’ Cheminformatics Engine
• eMolecules’ Fast-Search Engine
• Fast: most queries return in 0.1 to 2 seconds
• Stateless: answer query then forget user
◦ Each subsequent query (e.g. “page 2”) is a new
search.
• Efficient pagination: No overhead
◦ “Get rows 1-10 and stop”
◦ “Get rows 11-20 and stop” … without re-searching
for rows 1-10.
eMolecules’ Cheminformatics Engine
• eMolecules’ Power-Search: Hitlists
• Long-running server-side searches (launched
by browser, run to completion)
• Progress reports
• Interrupt/Cancel
• Refinement (“Filtering”) to narrow results
• Multiple stored hitlists
• List operations (merge, filter, union,
intersection, etc.)
2016: Replace Home-Grown Searching
• Why replace eMolecules’ home-grown
cheminformatics?
• Commercial products (e.g. ChemAxon) are
superior
• Available as “cartridge” in Postgres open-
source relational database
• Cost of maintaining compatibility with
OpenBabel open-source software project
Relational Databases and the Web
• Web: Stateless, fast, lightweight• Lots of fast queries, pagination, load spikes
• Guest users with no stake in results
• Abusive robots, industrial theft, criminals
• RDB: Stateful, unpredictable, overhead• Never “instantaneous” like Google or Bing – queries can
take minutes, even longer
• Poor pagination
• No refinement of search results
• No progress report: Will my search be done in 10 seconds or 2 hours??
• Hard to interrupt/cancel a search
• Result: User disappointment
Background: What RDBs do well
• “Does Enamine sell CAS# 5398-77-6?”
=> explain analyze select compound_id, cas_number from sample
where cas_number = '5398-77-6' and supplier_id = 681;
-> BitmapAnd
cost est: 555.00
actual time: 8.99
-> Bitmap Index Scan on i_sample_casno_8947
cost est: 0.00 to 5.77 rows=178
actual time: 0.023 rows=38
-> Bitmap Index Scan on i_sample_supplier_id_8947
cost est: 0.00 to 548.98 rows=20339
actual time: 8.96 rows=96626
Execution time: 13.056 ms
Background: What RDBs do well
• RDBs handle this query really well
• Statistics on each column of numbers are
highly predictive
• Good prediction Good execution plan
• This doesn’t work for chemical queries
• Statistics on chemical columns are poor
predictors
• Poor prediction Poor execution plan
Problem: Instant Results
• Searching millions of compounds
• Our test cases: “Find 1st page of results”
Selective Query: 3 hits“Stupid” Query: 1.3M hits
Selective query works well
=> explain analyze select count(1) from registry.smiles_indexed
where smiles |>| 'CN1CCN(CC1)S(=O)(=O)c2ccc(OCC)c(C(=O)O)c2' limit 10;
Limit
cost est: 62247.90 to 62247.91
actual time: 178.05
-> Aggregate
cost est: 62247.90 to 62247.91
actual time: 178.04
-> Index Scan using i_smiles_indexed_smiles on smiles_indexed
cost est: 102.60 to 60230.04 rows=807144
actual time: 177.577 rows=3
“Stupid” query is a disaster
=> explain analyze select count(1) from registry.smiles_indexed
where smiles |>| 'c1ccccc1' limit 10;
Limit
cost est: 62247.90 to 62247.91
actual time: 30787.81
-> Aggregate
cost est: 62247.90 to 62247.91
actual time: 30787.81
-> Index Scan using i_smiles_indexed_smiles on smiles_indexed
cost est: 102.60 to 60230.04 rows=807144
actual time: 24753.99 rows=1318621
Problem: Instant Results
• Ordinary SQL strategy won’t work
• Optimizer predictions are wrong for chemistry
• Query times extremely variable
• Even if you give hints, restructure your SQL,
and get it to work, you’re still in trouble
◦ RDB optimizers change/improve with every
software release
◦ Query plans can change unexpectedly as DB grows
◦ What works today may fail tomorrow!
Historical Note: eMolecules Technique
5398-77-6 MFCD00216485Dibenzofuran
Chemical
Query?
eMolecules
Chemical
Search
Postgres
Database
Yes
No
• How did eMolecules’ system handle this?
Historical Note: eMolecules Technique
• The eMolecules Chemical Search Engine
• Not a general-purpose database
• No query planning
• Exactly one query: Find the next N rows
• Dynamic optimization: search speeds up as it
proceeds
• Execution state (“Where did the last search
stop?”) encoded in a few bytes that can be
embedded in a URL
Solution: Postgres Partitioning
BB_molecules(parent table – empty)
BB_molecules_01(child table – 5%)
BB_molecules_20(child table – 5%)…
Query entire data set
(10 million compounds)
Query subsets
(500,000 compounds)
Solution: Postgres Partitioning
• Benefits of table partitions:
• Initial web query applied to just one partition
• Only query full data set (all partitions) if user
is interested in full result set
• User requests for full result set are only a
small fraction of actual cases
Partitioning Performance
• Advantage: “Stupid” query improved 95%
• Disadvantage: Selective queries slowed
• Very specific query: 3 hits• Search 2M: 85 msec
• Search 2M in 20 parts: 128 msec
• 43 msec slowdown (insignificant)
• Benzene query: 1st ten rows
from 1.3M hits:• Search full 2M: 28 seconds
• Search 5%: 1.0 seconds
• 30x speedup
[NO NAME]
CHEMW2 03241715442D
Created with ChemWriter - http://chemwriter.com
22 23 0 0 0 0 999 V2000
-7.7482 0.1596 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-7.0337 -0.2529 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-6.3192 0.1596 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-6.3192 0.9846 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-7.0337 1.3972 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-7.7482 0.9846 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.6047 1.3972 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
-4.8902 0.9846 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-6.0172 2.1117 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-5.1922 2.1117 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-4.8902 0.1596 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.1757 -0.2529 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.4612 0.1596 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.4612 0.9846 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.1757 1.3972 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.7467 -0.2529 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-2.7467 -1.0779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.0322 -1.4905 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.1757 -1.0779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.8902 -1.4905 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-3.4612 -1.4905 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-8.4627 -0.2529 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
3 4 1 0 0 0 0
4 5 1 0 0 0 0
5 6 1 0 0 0 0
6 1 1 0 0 0 0
4 7 1 0 0 0 0
7 8 1 0 0 0 0
7 9 2 0 0 0 0
7 10 2 0 0 0 0
8 11 2 0 0 0 0
11 12 1 0 0 0 0
12 13 2 0 0 0 0
13 14 1 0 0 0 0
14 15 2 0 0 0 0
15 8 1 0 0 0 0
13 16 1 0 0 0 0
16 17 1 0 0 0 0
17 18 1 0 0 0 0
12 19 1 0 0 0 0
19 20 1 0 0 0 0
19 21 2 0 0 0 0
1 22 1 0 0 0 0
M END
Power Search: eMolecules Hitlists
• Search results are stored in a “hitlist”
• Progress report during search (via partitioning)
• Persistent results (saved in database)
• Fast perusal of results (no further searching)
• Sorting, refining results
• Cancel a search
• A demo…
Implementation: RESTful JSON API
• Database layer separated from application
Cheminformatics
App Server
Web
App
Server
HTTP HTTP
RESTful
JSON
API
Postgres:
Supplier’s
Catalog Data
Postgres:
ChemAxon
Chemical Search
Summary
• RDBs: poorly suited to web expectations
• Cheminformatics makes it even worse
• Solution: partitioning and hitlists
• Lightweight queries◦ “Instant” response
◦ Efficient, stateless pagination
• Power queries◦ Long-running queries with progress reports
◦ Refine, sort, and merge results
◦ Interrupt/cancel features
RESULTS …
Result: A Web-capable Relational
Cheminformatics Database
Questions?
Craig A. James
CTO, eMolecules, Inc.