whitney symposium lecturejune 2008 1220331644496491 9
Post on 10-May-2015
595 Views
Preview:
TRANSCRIPT
Crowd-Sourcing to Build a Crowd-Sourcing to Build a Structure Centric Structure Centric
Community for Chemists Community for Chemists
Antony WilliamsAntony WilliamsWhitney Symposium 2008 - NetworksWhitney Symposium 2008 - Networks
Building a Structure Centric Community for Chemists
Social Networking for Social Networking for ChemistsChemists
Building a Structure Centric Community for Chemists
Network Drug Discovery ToolsNetwork Drug Discovery Toolswww.curehunter.comwww.curehunter.com
Building a Structure Centric Community for Chemists
Beware the Networks!Beware the Networks!
Building a Structure Centric Community for Chemists
Collaborative Authoring in Collaborative Authoring in AcademiaAcademia
Group level collaboration via WikisGroup level collaboration via Wikis
Building a Structure Centric Community for Chemists
Collaborative Authoring for Drug Collaborative Authoring for Drug DiscoveryDiscovery
PfizerpediaPfizerpedia
Building a Structure Centric Community for Chemists
Collaborative Knowledge Collaborative Knowledge Management Management
for Chemists – Wikipedia, Built by for Chemists – Wikipedia, Built by a Networka Network
Building a Structure Centric Community for Chemists
and biologists…WikiProteinsand biologists…WikiProteins
Building a Structure Centric Community for Chemists
WikiProteinsWikiProteins
What Is
Tegafur?
Building a Structure Centric Community for Chemists
Commonly Lacking…Commonly Lacking…
Approaches generally lack “structural Approaches generally lack “structural intelligence”intelligence” Structures have properties (Mw, MF, exp. & Structures have properties (Mw, MF, exp. &
pred. properties)pred. properties) Collections of structures need to be Collections of structures need to be
searchable by structuresearchable by structure Most data collections are “self-contained” and Most data collections are “self-contained” and
rarely connecting to other resources via rarely connecting to other resources via “structure”“structure”
Building a Structure Centric Community for Chemists
A Search Engine for ChemistsA Search Engine for Chemists
Questions a chemist might ask…Questions a chemist might ask… What is the melting point of n-butanol? What is the melting point of n-butanol? What is the chemical structure of Xanax?What is the chemical structure of Xanax? Chemically, what is viagra?Chemically, what is viagra? What are the stereocenters of cholesterol?What are the stereocenters of cholesterol? Where can I find publications about Taxol?Where can I find publications about Taxol? What are the different trade names for What are the different trade names for
Ketoconazole?Ketoconazole? What is the NMR spectrum of Aspirin?What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol What are the safety handling issues for Thymol
Blue?Blue?
ChemSpider can answer all of these questionsChemSpider can answer all of these questions
Building a Structure Centric Community for Chemists
ChemSpider Data ContentChemSpider Data Content
Over 20 million unique chemical structures :Over 20 million unique chemical structures : Online Databases –PubChem, Drugbank, HMDB, Online Databases –PubChem, Drugbank, HMDB,
WikipediaWikipedia Chemical Vendors – over 40 different vendors and Chemical Vendors – over 40 different vendors and
growinggrowing Personal Depositions – individual contributionsPersonal Depositions – individual contributions Journal Publishers Journal Publishers Content database vendorsContent database vendors Analytical data collectionsAnalytical data collections Patents (9 MILLION Structures to search patentsPatents (9 MILLION Structures to search patents)) Web scrapingWeb scraping
Content is linked back to the original data sourcesContent is linked back to the original data sources
Building a Structure Centric Community for Chemists
A Structure Centric Community A Structure Centric Community for Chemistsfor Chemists
A FREE ACCESS platform for deposition, A FREE ACCESS platform for deposition, management, curation, annotation and management, curation, annotation and extension of information associated with extension of information associated with chemical structureschemical structures
Semantically connect to other sites Semantically connect to other sites providing access to knowledge, data and providing access to knowledge, data and information of determined qualityinformation of determined quality
Search by alphanumeric text, chemical Search by alphanumeric text, chemical structure and substructure and combination structure and substructure and combination searchessearches
Predict properties for submitted structuresPredict properties for submitted structures
Building a Structure Centric Community for Chemists
Tell me about AspirinTell me about Aspirin
Building a Structure Centric Community for Chemists
Tell me about AspirinTell me about Aspirin
Building a Structure Centric Community for Chemists
Links out to KEGGLinks out to KEGGKyoto Encyclopedia of Genes and Kyoto Encyclopedia of Genes and
Genomes Genomes
Building a Structure Centric Community for Chemists
Tell me about AspirinTell me about Aspirin
Building a Structure Centric Community for Chemists
Tell me About AspirinTell me About Aspirin
Building a Structure Centric Community for Chemists
Tell me about AspirinTell me about Aspirin
Building a Structure Centric Community for Chemists
Tell me about AspirinTell me about Aspirin
Building a Structure Centric Community for Chemists
Abstract Compounds?Abstract Compounds?
Is there any information about Is there any information about “Quesnoin”?“Quesnoin”?
Type in the name (and there may be Type in the name (and there may be many) or other identifiermany) or other identifier
Paste a chemical structurePaste a chemical structure Draw the structureDraw the structure
Building a Structure Centric Community for Chemists
Example SearchExample Search
Building a Structure Centric Community for Chemists
Example Search Example Search
Building a Structure Centric Community for Chemists
Example Search 2Example Search 2
What compounds have a mass of 300+/-0.001?What compounds have a mass of 300+/-0.001?
or search a combination of intrinsic/predicted or search a combination of intrinsic/predicted propertiesproperties
Building a Structure Centric Community for Chemists
Example Search 2Example Search 2
Building a Structure Centric Community for Chemists
Complex SearchComplex Search
Building a Structure Centric Community for Chemists
Search Open Access Journals – Search Open Access Journals – ChemSpiderChemSpider
Building a Structure Centric Community for Chemists
Search PubMed – ChemSpiderSearch PubMed – ChemSpider
Building a Structure Centric Community for Chemists
The Quality of Data Online…The Quality of Data Online…
Aggregating data opens up quality issuesAggregating data opens up quality issues Structure-identifier associations are “dirty”Structure-identifier associations are “dirty” Structures are COMMONLY incorrect – Structures are COMMONLY incorrect –
stereochem issuesstereochem issues Manual curation of small databases is enough Manual curation of small databases is enough
work – what about millions of structures?work – what about millions of structures? Structures are far from perfect. What is a Structures are far from perfect. What is a
“correct structure”?“correct structure”? Full stereochemistry? Full stereochemistry? Historical timeline of structure?Historical timeline of structure? Who is the authority?Who is the authority?
Building a Structure Centric Community for Chemists
Who holds THE Quality Who holds THE Quality Authority?Authority?
Chemical Abstracts Service is the Chemical Abstracts Service is the structural authority today. 1400 (?) structural authority today. 1400 (?) employees, world standard in chemistry employees, world standard in chemistry informationinformation
101 years of knowledge, process and 101 years of knowledge, process and expertise. MANUAL curation is key. expertise. MANUAL curation is key. Robotic curation is enablingRobotic curation is enabling
How can an online, free access system How can an online, free access system peacefully co-exist with the authority? peacefully co-exist with the authority?
Building a Structure Centric Community for Chemists
Quality is a Major Issue- Quality is a Major Issue- Search ButanolSearch Butanol
Building a Structure Centric Community for Chemists
Crowd-sourcing Database Crowd-sourcing Database CompilationCompilation
Building a Structure Centric Community for Chemists
Wikipedia – Crowdsourcing Wikipedia – Crowdsourcing ChemistryChemistry
Building a Structure Centric Community for Chemists
Wikipedia Chemistry Curation Wikipedia Chemistry Curation projectproject
Only ca. 5000 organic structures, 7000 Only ca. 5000 organic structures, 7000 total structurestotal structures
MONTHS of work so far for a team of 6 MONTHS of work so far for a team of 6 peoplepeople
Many errors removed in the process. Many errors removed in the process. Curation process is a daily event for Curation process is a daily event for users/depositorsusers/depositors
Slow and torturous process for stereo Slow and torturous process for stereo molecules.molecules.
Building a Structure Centric Community for Chemists
Thymol Blue on ChemSpiderThymol Blue on ChemSpider
Data online includes:Data online includes: UV-vis spectrumUV-vis spectrum Measured experimental propertiesMeasured experimental properties Link to Wikipedia articleLink to Wikipedia article Links to chromatography detailsLinks to chromatography details Multiple identifiers/trade names etc.Multiple identifiers/trade names etc. Links to vendors/suppliers/other databasesLinks to vendors/suppliers/other databases Safety informationSafety information
Building a Structure Centric Community for Chemists
Differences between Differences between ChemSpider/WikipediaChemSpider/Wikipedia
ChemSpiderChemSpider WikipediaWikipedia
>20 million unique >20 million unique structuresstructures
~5000 organics, 2000 ~5000 organics, 2000 othersothers
Complex queries – Complex queries – Properties, Text, Properties, Text, structure/substructure, structure/substructure, OA publishers, Data OA publishers, Data Sources, …Sources, …
TextText
Prediction of propertiesPrediction of properties NoNo
Analytical DataAnalytical Data NoNo
Active depositors/curators Active depositors/curators – 30 – 30
Active editors – about 50 Active editors – about 50 (?)(?)
5000 people/day; 1100 5000 people/day; 1100 registeredregistered
????????
Compound monographs Compound monographs linkedlinked
Detailed compound Detailed compound monographsmonographs
Building a Structure Centric Community for Chemists
Differences between Differences between Wikipedia/ChemSpiderWikipedia/ChemSpider
WikipediaWikipedia ChemSpiderChemSpider
Supported by tried and Supported by tried and tested Media-Wiki tested Media-Wiki platform.platform.
PrimarilyPrimarily Microsoft .NET Microsoft .NET technologies with OS technologies with OS components components
Established infrastructure Established infrastructure and Wikipedia Foundation and Wikipedia Foundation TeamTeam
““Out of a basement” on Out of a basement” on three servers and 5 three servers and 5 volunteersvolunteers
Chemistry is a subset of Chemistry is a subset of the ‘Pediathe ‘Pedia
Chemistry is the focus of Chemistry is the focus of ‘Spider‘Spider
GFL licensing for GFL licensing for everythingeverything
Mixed “licensing”Mixed “licensing”
Strong team of WP:Chem Strong team of WP:Chem advocates, curators and advocates, curators and adminsadmins
Growing team of Growing team of WP:Chem advocates, WP:Chem advocates, curators and adminscurators and admins
Worldwide reputation as Worldwide reputation as quality sourcequality source
Growing reputation as Growing reputation as focused on qualityfocused on quality
Building a Structure Centric Community for Chemists
Crowd-sourcing CurationCrowd-sourcing Curation
How to curate data for millions of How to curate data for millions of structures? structures?
Robot processes can clean up depositionsRobot processes can clean up depositions Search for Chloride and check molecular formula Search for Chloride and check molecular formula
for Clfor Cl Check for stereochemistry and remove names Check for stereochemistry and remove names
with stereo with stereo Provide a simple-to-use platform to curate, Provide a simple-to-use platform to curate,
annotate and tag data annotate and tag data Provide curator administration to prevent Provide curator administration to prevent
vandalism (Veropedia)vandalism (Veropedia)
Building a Structure Centric Community for Chemists
Multi-level Curation and Multi-level Curation and ApprovalApproval
Building a Structure Centric Community for Chemists
Post CommentsPost Comments Anyone can “Post Comments” associated Anyone can “Post Comments” associated
with a structure. To curate data we with a structure. To curate data we require login to trackrequire login to track
Building a Structure Centric Community for Chemists
Crowd-sourcing ChemistryCrowd-sourcing Chemistry
Crowd-sourced curation: identify and tag Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify errors, edit names, synonyms, identify records for deprecationrecords for deprecation
ALSOALSO
Crowd-sourced deposition: anyone can Crowd-sourced deposition: anyone can deposit data (structures, text, images, deposit data (structures, text, images, analytical data)analytical data)
Building a Structure Centric Community for Chemists
But, when registered and But, when registered and logged in…logged in…
Ability to curate and add to the databaseAbility to curate and add to the database Add structuresAdd structures ““Clean” structuresClean” structures Add data (spectra, CIFs, images) Add data (spectra, CIFs, images) Add links to other pages (URLs)Add links to other pages (URLs) Add publication detailsAdd publication details
Building a Structure Centric Community for Chemists
Adding to the Database - Adding to the Database - StructureStructure
Building a Structure Centric Community for Chemists
Adding New Text DataAdding New Text Data
Add Publication
Add Identifier
Add URL
Building a Structure Centric Community for Chemists
Adding Supplementary Info to a Adding Supplementary Info to a StructureStructure
Building a Structure Centric Community for Chemists
Can ChemSpider Enable Can ChemSpider Enable Discovery?Discovery?
Yes, chemists can search by text, structure, Yes, chemists can search by text, structure, substructure or properties to look at substructure or properties to look at relationships and probe drug discoveryrelationships and probe drug discovery
Building a Structure Centric Community for Chemists
ChemSpider – Research in ChemSpider – Research in ProgressProgress
Supporting Open Notebook Science as a Supporting Open Notebook Science as a repository – JC Bradley at Drexel repository – JC Bradley at Drexel UniversityUniversity
For the purpose of online virtual screening For the purpose of online virtual screening Applying descriptors of various types to Applying descriptors of various types to
filter a database of 20 million compoundsfilter a database of 20 million compounds
In progress:In progress: Utilizing SimBioSys’ LASSO DescriptorUtilizing SimBioSys’ LASSO Descriptor Collaboration based on NISS’ ChemModLabCollaboration based on NISS’ ChemModLab
Building a Structure Centric Community for Chemists
LASSO LASSO Ligand Activity by Surface Ligand Activity by Surface
Similarity Order Similarity Order
Building a Structure Centric Community for Chemists
LASSO Descriptors on ChemSpiderLASSO Descriptors on ChemSpiderSEMANTIC WEB in actionSEMANTIC WEB in action
Building a Structure Centric Community for Chemists
LASSO Searching Method 1LASSO Searching Method 1
Ask the question “What are the top 1000 Ask the question “What are the top 1000 molecules with similar LASSO descriptors molecules with similar LASSO descriptors to the actives for the Estrogen Receptor”to the actives for the Estrogen Receptor”
Building a Structure Centric Community for Chemists
It WORKS - Enrichment PlotIt WORKS - Enrichment Plot
60% of the actives were recovered in the top 1% of 60% of the actives were recovered in the top 1% of the database.the database.
““Environmental binders” are weak binders Environmental binders” are weak binders The top ranked compounds may well be active ER The top ranked compounds may well be active ER
bindersbinders Likely candidates for experimental investigationLikely candidates for experimental investigation
Building a Structure Centric Community for Chemists
Tipping PointTipping Point
Tipping pointTipping point - the - the pointpoint at which a slow at which a slow gradual change gradual change becomes irreversible becomes irreversible and then proceeds with and then proceeds with gathering pacegathering pace
Building a Structure Centric Community for Chemists
ChemSpider Forums/BlogsChemSpider Forums/Blogs
Forum.chemspider.comForum.chemspider.com www.chemspider.com/blogwww.chemspider.com/blog
Building a Structure Centric Community for Chemists
ChemSpider TouchGraphChemSpider TouchGraph
Building a Structure Centric Community for Chemists
What would we most like to What would we most like to do?do?
Enable “Collaborative Science”. What Enable “Collaborative Science”. What would that look like?would that look like?
Access to chemical supplies when people Access to chemical supplies when people need themneed them
Awareness of available literature, patents, Awareness of available literature, patents, databases of curated content – whether databases of curated content – whether Open Access or not. Transaction fees (or Open Access or not. Transaction fees (or not) are between user and providernot) are between user and provider
Host Open Notebook Science exchangesHost Open Notebook Science exchanges
Building a Structure Centric Community for Chemists
““ChemSpider Inside”ChemSpider Inside” Instrument vendors integrated ChemSpider to Instrument vendors integrated ChemSpider to
their metabolism ID project – ChemSpider linked their metabolism ID project – ChemSpider linked to all Mass Spec Intruments doing Metabolite to all Mass Spec Intruments doing Metabolite ID?ID?
Wikipedia roundtrip linking to ChemSpiderWikipedia roundtrip linking to ChemSpider Google indexing ChemSpider at “fixed rate”Google indexing ChemSpider at “fixed rate” Integration to desktop drawing packagesIntegration to desktop drawing packages Members of Microsoft BioIT AllianceMembers of Microsoft BioIT Alliance Discussions on Taverna’s Workflow Sourceforge Discussions on Taverna’s Workflow Sourceforge
groupgroup Hosting Open Access articles shortly…Hosting Open Access articles shortly…
Building a Structure Centric Community for Chemists
Where to from here? Where to from here? Short Short termterm
Integrated text and structure/substructure Integrated text and structure/substructure searching of the Open Access literature is in searching of the Open Access literature is in developmentdevelopment
Web-based scraping of structure-based Web-based scraping of structure-based information – examples in placeinformation – examples in place
Enhanced web services layer to integrate Enhanced web services layer to integrate searchessearches
Deposit updated Patent Database (9 million Deposit updated Patent Database (9 million structures)structures)
Reaction handling and depositionReaction handling and deposition
Building a Structure Centric Community for Chemists
Where to from here? Where to from here? Mid-termMid-term
Spidering for Chemistry – extract data from Spidering for Chemistry – extract data from articles, webpages and data sources AND stay articles, webpages and data sources AND stay within copyrightwithin copyright
WiChempedia project – wiki-layers on top of WiChempedia project – wiki-layers on top of ChemSpider, alongside Wikipedia curation ChemSpider, alongside Wikipedia curation project. project.
Deeper integration to text-based searching Deeper integration to text-based searching and conversion of chemical names to and conversion of chemical names to structures for online structure searching: structures for online structure searching: Improved integration with NCBI Entrez systemImproved integration with NCBI Entrez system Deliver “dedicated websites” for specific publishersDeliver “dedicated websites” for specific publishers
Building a Structure Centric Community for Chemists
Where to from here? Where to from here? Mid-TermMid-Term
An extensible datamodel “on the fly” An extensible datamodel “on the fly” allows us to easily expand to integrate allows us to easily expand to integrate abstract data to structures abstract data to structures
Data mine and curate “parameters” – Data mine and curate “parameters” – physicochemical and physiological physicochemical and physiological parameters to enable QSAR analysis, parameters to enable QSAR analysis, data modeling and provision of models data modeling and provision of models online (UNC-Chapel Hill, NISS)online (UNC-Chapel Hill, NISS)
Building a Structure Centric Community for Chemists
Our ChallengesOur Challenges
There are “no There are “no employees”employees”
ChemSpider is non-ChemSpider is non-fundedfunded
System is hyper-System is hyper-dependent on ISP, power dependent on ISP, power and limited compute and limited compute powerpower
We are upsetting a lot of We are upsetting a lot of people – evangelists, people – evangelists, cheminformatics system cheminformatics system vendors, publishers, data vendors, publishers, data content providerscontent providers
Building a Structure Centric Community for Chemists
Acknowledgments Acknowledgments
The ChemSpider team of volunteer The ChemSpider team of volunteer developersdevelopers
ChemSpider Advisory GroupChemSpider Advisory Group Our curators, depositors and usersOur curators, depositors and users Suppliers of commercial software – Suppliers of commercial software –
Microsoft, ACD/Labs, OpenEye, Microsoft, ACD/Labs, OpenEye, ChemAxon, SimBioSysChemAxon, SimBioSys
SureChem – Structure Based Online SureChem – Structure Based Online Patent SearchingPatent Searching
Building a Structure Centric Community for Chemists
Further readingFurther reading
www.chemspider.com/blogwww.chemspider.com/blog Internet-based tools for communication and Internet-based tools for communication and
collaboration in chemistry, Drug Discovery collaboration in chemistry, Drug Discovery Today, Volume 13, Numbers 11/12, June Today, Volume 13, Numbers 11/12, June 2008 502-506, 2008 502-506, doi:10.1016/j.drudis.2008.03.015doi:10.1016/j.drudis.2008.03.015
A perspective of publicly accessible/open-A perspective of publicly accessible/open-access chemistry databases, Drug Discovery access chemistry databases, Drug Discovery Today, Volume 13, Numbers 11/12, June Today, Volume 13, Numbers 11/12, June 2008, 495-501, 2008, 495-501, doi:10.1016/j.drudis.2008.03.017doi:10.1016/j.drudis.2008.03.017
top related