machine learning in the new world of scholarly communication philip e. bourne university of...
TRANSCRIPT
![Page 1: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/1.jpg)
Machine Learning in the New World of Scholarly
Communication
Philip E. BourneUniversity of California San Diego
[email protected]://www.sdsc.edu/pb/Talks/
1ICMLA 2008
![Page 2: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/2.jpg)
Disclaimer
I am not an expert in machine learning, but have applied SVMs to biological systems on occasion
2Protein Motions: Gu et al. PLoS Comp. Biol., 2006 2(7) e90P-P Interfaces: Chung et al. Proteins 2006 62(3) 630-640
![Page 3: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/3.jpg)
So Why am I Here?
There are events happening in what I broadly refer to as “scholarly communication” which I
believe offer new opportunities for those interested in machine learning
What are those opportunities and how can they be exploited?
3
![Page 4: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/4.jpg)
What If…
• What if … negative data was as easily obtainable as positive data
• What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge on a scale not previously possible
• What if … that knowledge included rich media
• What if … the value of that knowledge could be weighted according to the authority of the source
4
![Page 5: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/5.jpg)
Some big “Ifs”
But.. Would offer a much richer medium to learn from ..
Take home message – parts of this medium are already here and those
of us generating that medium are keen to collaborate
![Page 6: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/6.jpg)
Some big “Ifs”
Lets take a step back and see where we are today
![Page 7: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/7.jpg)
Today’s Research Cycle
Research[Grants]
JournalArticle
ConferencePaper
PosterSession
Feds
Societies
Publishers
Reviews
BlogsCommunity Service/Data
![Page 8: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/8.jpg)
Tomorrows Research Cycle
• The relationship between scientist and publisher is quite different
• The publisher is a warehouse for the workflow of scientific endeavor not just a repository for the end product
8
![Page 9: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/9.jpg)
Tomorrows Research Cycle: Evidence
• Publishers hubs:– Elsevier portals– PLoS collections
• Open Access/open review e.g. Biology Direct
• NIH Roadmap requires data be accessible• New Resources:
– www.researchgate.net– MetaLab (Borya Shakhnovich)
![Page 10: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/10.jpg)
What If…
• What if … negative data was as easily obtainable as positive data
• What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge
• What if … that knowledge included rich media
• What if … the value of that knowledge could be weighted according to the authority of the source
10
![Page 11: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/11.jpg)
Example: The Protein Structure Initiative The X-ray Crystallography Pipeline
What if … negative data was as easily obtainable as positive data
Basic Steps
Target Selection
Crystallomics• Isolation,• Expression,• Purification,• Crystallization
DataCollection
StructureSolution
StructureRefinement
Functional Annotation Publish
Remains more of an Art than a Science
http://kb.psi-structuralgenomics.org/
![Page 12: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/12.jpg)
Positive and Negative Data are Required by the NIH to be deposited immediately
• Data are described by an ontology
• Perhaps some underlying principles can be learnt, particularly as the amount of data is increasing rapidly
http://pepcdb.pdb.org/PepcDB/documentation/pepcDB-v9.3.jpg
![Page 13: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/13.jpg)
What If…
• What if … negative data was as easily obtainable as positive data
• What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge
• What if … that knowledge included rich media
• What if … the value of that knowledge could be weighted according to the authority of the source
13
![Page 14: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/14.jpg)
A First Step is to Have Open and Usable Access to the Scientific
Literature..
We are making steps in that direction
![Page 15: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/15.jpg)
15
NIH Public Access Policy“The research supported by the National Institutes of Health (NIH) is essential to improving human health. Public access to this research is vital – today and for generations to come.”From a letter from NIH Director Zerhouni to grantees, February 3rd, 2005
![Page 16: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/16.jpg)
16
More and more authors care about improving access to their papers…
“Faced with the option of submitting to an open-access or closed-access journal, we now wonder whether it is ethical for us to opt for closed access on the grounds of impact factor or preferred specialist audience.”
-- Costello and Osrin in The Lancet
![Page 17: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/17.jpg)
17
Where are we Today?
• NIH and other government funders have mandated open access
• Full text increasingly on-line and potentially usable
• Traditional publishers have used the internet as a distribution medium, but the power of the medium has yet to be realized
• Data increasingly on-line but not integrated with the publication derived from it
![Page 18: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/18.jpg)
The Growth of Open Access Literature
18
![Page 19: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/19.jpg)
Open Access(Creative Commons License)
1. All published materials available on-line free to all (author pays model)
2. Unrestricted access to all published material in various formats eg XML provided attribution is given to the original author(s)
3. Copyright remains with the author
19
![Page 20: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/20.jpg)
Open Access(Creative Commons License)
1. All published materials available on-line free to all (author pays model)
2. Unrestricted access to all published material in various formats eg XML provided attribution is given to the original author(s)
3. Copyright remains with the author The catalyst
PLoS Comp Biol 2008 4(3) e100003720
![Page 21: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/21.jpg)
Community Reaction?
Most scientists have no idea that this implies that anyone can take their material and enhance it e.g., via
mashup and effectively republish it
21
What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge
![Page 22: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/22.jpg)
Consider an Example
What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge
Fink & Bourne 2007 CT Watch 3(3) 26-31
![Page 23: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/23.jpg)
Database and Journal Integration- The Test Bed
http://www.pdb.org/
Journals
Database
23
http://www.pubmedcentral.nih.gov
![Page 24: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/24.jpg)
The Protein Data Bank
• Paper not published unless data are deposited – strong data to literature correspondence
• Highly structured data conforming to an extensive ontology
• DOI’s assigned to every structure – http://www.doi.org
http://www.pdb.org24
![Page 25: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/25.jpg)
Seamless Integration between Data and the Literature – What
Does That Imply?
• Improving semantic consistency in the literature – best done at the point of authoring
• Post processing to establish semantic content
• New forms of visualization and interaction at the presentation layer
25
![Page 26: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/26.jpg)
Seamless Integration between Data and the Literature – What
Does That Imply?
• Improving semantic consistency in the literature – best done at the point of authoring
• Post processing to establish semantic content
• New forms of visualization and interaction at the presentation layer
26
![Page 27: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/27.jpg)
1. A link brings up figures from the paper
0. Full text of PLoS papers stored in a database
2. Clicking the paper figure retrievesdata from the PDB which is
analyzed
3. A composite view ofjournal and database
content results
BioLit: Tools for New Modes of Scientific Dissemination
• Biolit integrates biological literature and biological databases and includes:– A database of journal
text– Authoring tools to
facilitate database storage of journal text
– Tools to make static tables and figures interactive
4. The composite view haslinks to pertinent blocks
of literature text and back to the PDB
1.
2.
3.
4.
The Knowledge and Data Cycle
http://biolit.ucsd.edu27
![Page 28: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/28.jpg)
http://biolit.ucsd.eduNucleic Acids Research 2008 36(S2) W385-389
PSP Washington DC Feb. 2008 28
![Page 29: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/29.jpg)
29
![Page 30: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/30.jpg)
30
![Page 31: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/31.jpg)
31
![Page 32: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/32.jpg)
ICTP Trieste, December 10, 2007 32
![Page 33: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/33.jpg)
33
What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge
Immunology Literature
Cardiac DiseaseLiterature
![Page 34: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/34.jpg)
Semantic Consistency is Best Done at the Point of Authoring
What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge
![Page 35: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/35.jpg)
Author Paper
Word File in Docx formatPublisher
BioLit Plugin Project
35
![Page 36: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/36.jpg)
BioLit Plugin Project
• Leverages Office Open XML used in Microsoft Office 2007
• Custom schema attached to document and used to automatically XML tag ontology terms and database identifiers within a research paper
• Ontology tagging assists publication of scientific research by aiding efficient and accurate automated categorization and promotion of information dissemination
• Conversion of manuscript to NLM DTD for direct submission to publisher
Automated Ontology & ID Tagging within Microsoft Word Documents
36
![Page 37: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/37.jpg)
BioLit Plugin ProjectRather than Post-processing the Document the
Author Controls the Semantic Tagging
37
![Page 38: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/38.jpg)
Plugin Architecture
38
![Page 39: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/39.jpg)
Ontologies are Stored in a Local Database
39
![Page 40: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/40.jpg)
User Configurable Selection
• Fully user configuration ontology and database identifier selection
• All searches occur within the user’s desktop computer
• Desired ontologies are downloaded and installed automatically, and update periodically
• BioLit installer XML file provides the application with the information needed to download and install ontologies.
40
![Page 41: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/41.jpg)
What If…
• What if … negative data was as easily obtainable as positive data
• What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge
• What if … that knowledge included rich media
• What if … the value of that knowledge could be weighted according to the authority of the source
41
![Page 42: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/42.jpg)
What Do We Mean by Rich Media?
Non traditional ways of conveying scientific data and knowledge ..
Video, podcasts, postercasts, blogs…
What if … that knowledge included rich media
![Page 43: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/43.jpg)
YouTube for Scientists www.scivee.tv
43What if … that knowledge included rich media
![Page 44: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/44.jpg)
Motivation
44What if … that knowledge included rich media
![Page 46: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/46.jpg)
Pubcast
With voice to text conversion, presentation materials etc. new knowledge is available to supplement already existing knowledge from
the paper
46What if … that knowledge included rich media
![Page 47: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/47.jpg)
Postercasts
What if … that knowledge included rich media
Again additional knowledge can be used which until now has not been captured
![Page 48: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/48.jpg)
What If…
• What if … negative data was as easily obtainable as positive data
• What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge
• What if … that knowledge included rich media
• What if … the value of that knowledge could be weighted according to the authority of the source
48
![Page 49: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/49.jpg)
First You Have to Identify the Source
What if … the value of that knowledge could be weighted according to the authority of the source
http://openid.nethttp://www.researcherid.com
![Page 50: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/50.jpg)
How Do we Weight the Various Knowledge Sources?
• Peer reviewed literature
• Reviews (papers, grants, proceedings)
• Blog postings• Database entries
What if … the value of that knowledge could be weighted according to the authority of the source
![Page 51: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/51.jpg)
How Do we Weight the Various Knowledge Sources?
• A token system• Tokens can be
authenticated by any user of that content
• Page ranking• ??
What if … the value of that knowledge could be weighted according to the authority of the source
PLoS Comp Biol Editorial this month
![Page 52: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/52.jpg)
In Conclusion
• Scholarly communication is in a state of rapid change
• Content easily available for machine learning is expanding and includes new content types
• New opportunities are here already
![Page 53: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649ddf5503460f94ad8c9e/html5/thumbnails/53.jpg)
Acknowledgements
• SciVee Team– Apryl Bailey– Tim Beck
– Leo Chalupa
– Marc Friedman– Alex Ramos– Willy Suwanto
• BioLit Team• J. Lynn Fink• Sergey Kushch• Marco Martinez• Greg Quinn• Parker Williams
CT Watch 2007, 3(3) 26-31
53