patent chemisty big bang: utilities for smes
TRANSCRIPT
1
www.guidetopharmacology.org
The open patent chemistry “big bang”: large opportunities for small enterprises
Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative Physiology, University of Edinburgh
ACS Mon, Mar 14 CINF: Division of Chemical Information, 79 SESSION: Chemical Information for Small Businesses & Startups 1:00 PM - 4:55 PM- Room 24C 4:25pm - 4:50pm,
http://www.slideshare.net/cdsouthan/patent-chemisty-big-bang-utilities-for-smes
2
Abstract (will be skipped for presentation)
In 2012, after the first IBM open deposition of 2.5 million structures, few would have predicted that PubChem compounds that include patent-extracted submissions would approach 20 million by 2015 (PMID 26194581). The current major open patent chemistry feeds (in size order) are NextMove, SCRIPDB, Thomson Pharma, IBM and SureChEMBL. The comparative statistics of sources and the arguments that the coverage probability of lead compound prior-art structures is now very high, will be presented. The consequences are that the academic community and small companies can now patent-mine extensively in PubChem and SureChEMBL, possibly even without needing commercial sources to support their own filings. Other recent major enabling aspects for small institutions include a) the open availability of patent full-text for querying b) a range of free tools for DIY chemistry extraction (PMID 23618056) and c) automatic bioentity mark-up in patent text (e.g. protein names) from the SureChEMBL/SciBite collaboration. Examples of DIY analysis of newly published patents will be shown. Even for small enterprises not filing directly open patent chemistry presents a big expansion in accessible SAR space and aspects of mining this will be exemplified. However, open chemistry extraction does bring in a variety of artefacts that add confounding structural “noise” These include a) permutations of mixtures and chiral exemplifications, b) virtual structures c) extractions from documents cannot directly indicate IP status and d) “common chemistry” swamping. These problems and some partial solutions using PubChem filters will be discussed.
3
Encouraging preface
4
Outline
• Balancing IP against bioactivity mining• Source coverage for patent extraction• Caveats with automated extraction• The example of US9056843• Source extraction comparisons• DIY extraction• Questions on open searching • Conclusions• References
5
IP vs SAR from open patent mining IP assessment
• Essential source of prior art chemistry • De facto adjunct to commercial sources• Improved portals (EPO, WIPO, FPOL)• SureChEMBL, TRP & BindingDB active• PubChem content is chemistry from
patents, not patented chemistry • CNER brainless compared to expert IP-
relevance selection• Claim section extraction often weak• Extracted artefacts confounding (e.g.
mixtures & virtuals)• Dense image tables still a coverage gap• IBM and SCRIPDB static in PubChem • Asian chemistry shortfall• The “common chemistry” problem• Patent blitzing for drug candidates
Bioactivity data mining
• Circa 5x more SAR than literature• Patent families collapse to < 100K
C07D primary documents• Advanced query options in
SureChEMBL • Bulk synthesis extraction (NextMove)• Valuable intersects with papers, authors
and targets via ChEMBL• Easy intersecting with DIY chemistry
extraction from any document• Obfuscation in example > assay data• Challenge of judging scientific quality• Only ~ 5 mil structures potentially
linkable to bioactivity data• Thus ~ 15 million have marginal utility• CNER > structural multiplexing
6
Big chemistry: prior art statistics
March 2016 snapshots
• GDB-13: 907 million virtual structures (similarity search)• Google InChIKey: 120+? million (exact match search)• EBI UniChem: 110.7 million 27 sources (exact match search) • CAS: 109 million substances (commercial, similarity search)• PubChem: 89 million 390 sources (similarity search)• ChemSpider: 43 million 510 sources (similarity search)• SureChEMBL: 16.8 million (similarity search)• GVKBio: 6.2 million (commercial bioactivity capture from patents and
papers, similarity search)
7
History of patent chemistry feeds into PubChem
• 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from patents and papers (now 4.3 mil, ~40% patents)
• 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil - SLING Consortium EPO extraction 0.1 mil• 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil• 2013 - SureChem, CNER + image, 9.0 mil• 2014 - BindingDB USPTO assay extraction (now 0.08 mil) • 2015- (CNER+images + CWU)
• SureChEMBL 13.0 mil• IBM phase 2, 7.0 mil, • NextMove Software 1.4 mil synthesis mapping
• 2016 - SureChEMBL 15.8 mil• CIDs from CNER extractions 19.1 mil (from 88.8 mill, 4th March)• Total patent chemistry with estimate from TRP ~ 20.5 mill
8
CNER patent sources vs. patent and paper curation:corroboration and divergence
IBM + SCRIPDB +
SureChembl + NextMove
= 19.01
ChEMBL20 = 1.45
Thomson Pharma = 4.3
17.3
0.18
1.4 2.5
0.12 0.25
0.9Counts are PubChem Compound Identifiers (CIDs) in millions
9
CNER caveats (I) fragmentation: Mw plots
Can be partially ameliorated by using Mw ranking as a filter
10
CNER caveats (II) the bioactivity-gap: majority of patent chemistry has no linked assay data
11
CNER caveats (III): strange patent-unique structures
• Weird stuff generally non-biological chemistry (i.e. not A61)• For the record C07D = 10.9, A61K = 0.9, (C097D + A61K) = 0.81 mill CIDs
12
CNER caveats (IV): mixture extractions (a mixed blessing)
• Mostly TFA or HCl salts • Includes combination claims and reactant mixtures• Causes sources to appear more divergent by exact match statistics • PubChem splits to component CIDs while maintaining the back-mapping• Can normalise with “CovalentUnitCount =1” filter
13
An example
“Trifluoromethyl-oxadiazole derivatives and their use in the treatment of disease” (Novartis)
PTC for the patent family WO2013008162, 2013-01-17
14
SAR table
All three data sets extracted and example-numbered in BindingDB
15
PubChem retrieval by patent number -> series cluster
16
Extraction splits by source, date and isomeric connectivity:(it can get complicated….)
Different sources (SIDs) for same structure (CID)
Different CID isomers with same core connectivity
17
Impressive SureChEMBL family extraction
4830 rows 648 IDs mapped to 511 PubChem CIDs
18
Extraction source
selectivity
• 151 BindingDB CIDs direct from PubChem • 93 Thomson Pharma CIDs (within the 151 above)• 296 SDFs from SciFinder > 269 CIDs• 648 SureChEMBL IDs > 511 CIDs• Numbers are not absolute because of “round tripping” mapping issues but
they illustrate the selectivity and extent of open coverage
19
Orthogonal entity mark-up(I) : Ferret (Chrome plug-in)
20
Orthogonal entity mark-up (II) : SciBite’s Termite (within SureChEMBL)
21
Roll-your-own extraction (II): OSRA
22
Roll-your-own extraction (I): ChemAxon chemicalize.org
23
Recent comparative analysis
• Compared SureChEMBL and IBM with SciFinder and Reaxys for a small patent set (i.e. open vs commercial)
• Concluded; “50–66 % of the relevant content from the latter was also found in the former”
• Equivalent comparisons executed in the latest PubChem with all patent sources would probably record a higher overlap
Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents, Senger, et al. J. Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)
http://www.ncbi.nlm.nih.gov/pubmed/26457120
24
First 64K$ Q: can you search your novel chemistry in open dbs?
• The InChIKey connectivity layer already facilitates blinded exact match (isomer-agnostic) searching anywhere, including Google
• PubChem and SureChEMBL default to https; so searching is secure • There is (and never will be?) patent case law where novelty was
challenged in court based on structures intercepted from public servers• Without metadata (e.g. target & disease) interception per se not much use • As for sequence data, hard evidence of serious competitive damage via
query interception remains zero (after 20+ years)• Commercial dbs cannot capture all prior art, so need open check anyway
25
Second 64K$ Q: Can you file based on open-only diligence?
If convinced your novel series < billion$ drug, maybe not - but consider
• Chances of completely missing an overlapping chemical series in open sources from a competing patent are diminishing
• Prior art is confounded anyway by the 18-month publication shadow and Markush enumeration
• Filing a 12 month provisional is low cost option• Portal queries allow you to find relevant patents (e.g. by target name)
even if open chemistry extraction was limited• The searches that really count are the ones the patent examiner does
for you (on payment) using all their sources (including PubChem)• However, attorney costs for drafting applications need balancing
against savings on commercial patent resources
26
Conclusions
• The “Big Bang” of open chemistry and full text from patents now make these an essential part of IP and bioactivity assessments for SMEs
• The combination of SureChEMBL and other sources within PubChem provide over 20 million patent-extracted structures and powerful analysis options
• The gap between open and commercial has narrowed to the point you can at least consider doing without the latter
• Note also the former has functionality absent from the latter • Bioactivity identification, mining and target mapping are still challenging but
becoming easier• It is important to understand patent chemistry automated extraction quirks,
artefacts, and pitfalls so you can filter these
27
References and questions
http://cdsouthan.blogspot.com/ 19 posts have the tag “patents”
http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624(with PubMed Commons data link)
www.ncbi.nlm.nih.gov/pubmed/25415348 http://www.ncbi.nlm.nih.gov/pubmed/23399051
http://www.ncbi.nlm.nih.gov/pubmed/23618056