semantics enabled industrial and scientific applications: research, technology and deployed...
TRANSCRIPT
![Page 1: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/1.jpg)
Semantics Enabled Industrial and Scientific Applications: Research, Technology and
Deployed Applications Part III: Biological Applications
Keynote - the First Online Metadata and Semantics Research Conference
http://www.metadata-semantics.org November 23, 2005Amit Sheth
LSDIS Lab, Department of Computer Science,University of Georgia
http://lsdis.cs.uga.edu
Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Christopher Thomas, Cartic Ramakrishan.
![Page 2: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/2.jpg)
Computation, data and semantics in life sciences• “The development of a predictive biology will likely be one
of the major creative enterprises of the 21st century.” Roger Brent, 1999
• “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000
• "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins
• We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb
We will show how semantics is a key enabler for achieving the above predictions and visions.
![Page 3: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/3.jpg)
Bioinformatics Apps & Ontologies• GlycOGlycO: A domain ontology for glycan structures, glycan functions
and enzymes (embodying knowledge of the structure and metabolisms of glycans) Contains 600+ classes and 100+ properties – describe structural
features of glycans; unique population strategy URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco
• ProPreOProPreO: a comprehensive process Ontology modeling experimental proteomics Contains 330 classes, 40,000+ instances Models three phases of experimental proteomics* –
Separation techniques, Mass Spectrometry and, Data analysis; URL: http://lsdis.cs.uga.edu/projects/glycomics/propreo
• Automatic semantic annotation of high throughput experimental data Automatic semantic annotation of high throughput experimental data (in progress)
• Semantic Web Process with WSDL-S for semantic annotations of Web Semantic Web Process with WSDL-S for semantic annotations of Web ServicesServices
– http://lsdis.cs.uga.edu -> Glycomics project (funded by NCRR)
![Page 4: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/4.jpg)
GlycO – A domain ontology for glycans
![Page 5: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/5.jpg)
GlycO
![Page 6: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/6.jpg)
Structural modeling and populationchallenges in GlycO• Extremely large number of glycans occurring in
nature• But, frequently there are small differences
structural properties• Modeling all possible glycans would involve
significant amount of redundant classes• Redundancy results in often fatal complexities in
maintenance and upgrade• Population
– Manual– Extraction and integration from external knowledge
sources– GlycoTree – exploiting structural composition rules
![Page 7: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/7.jpg)
Ontology population workflow
GlycoTreeTakahashi, Kato 2003
![Page 8: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/8.jpg)
GlycoTree – A Canonical Representation of N-Glycans
N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251
-D-GlcpNAc-D-GlcpNAc-D-Manp-(1-4)- -(1-4)-
-D-Manp -(1-6)+-D-GlcpNAc-(1-2)-
-D-Manp -(1-3)+-D-GlcpNAc-(1-4)-
-D-GlcpNAc-(1-2)+
-D-GlcpNAc-(1-6)+
![Page 9: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/9.jpg)
Beyond expressiveness afforded in OWL• Probabilistic• more
![Page 10: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/10.jpg)
Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated.
Example: Mass spectrometry analysis
Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
![Page 11: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/11.jpg)
Mass Spectrometry ExperimentEach m/z value in mass spec diagrams can
stand for many different structures (uncertainty wrt to structure that corresponds to a peak)
• Different linkage• Different bond• Different isobaric structures
![Page 12: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/12.jpg)
Very subtle differences
• Peak at 1219.1 • Same molecular
composition• One diverging link• Found in different
organisms• background knowledge
(found in honeybee venom or bovine cells) can resolve the uncertainty
These are core-fucosylated high-mannose glycans
CBank: 16155Honeybee venom
CBank: 16154Bovine
![Page 13: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/13.jpg)
Even in the same organism
• Both Glycans found in bovine cells
• Both have a mass of 3425.11
• Same composition• Different linkage• Since expression levels
of different genes can be measured in the cell, we can get probability of each structure in the sample
Different enzymes lead to these linkages
CBank: 21821
CBank: 21982
![Page 14: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/14.jpg)
Model 1: associate probability as part of Semantic Annotation
• Annotate the mass spec diagram with all possibilities and assign probabilities according to the scientist’s or tool’s best knowledge
![Page 15: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/15.jpg)
P(S | M = 3461.57) = 0.6 P(T | M = 3461.57)
= 0.4
Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
![Page 16: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/16.jpg)
Model 2: Probability in ontological representation of Glycan structure• Build a generalized probabilistic glycan
structure that embodies several possible glycans
![Page 17: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/17.jpg)
N-GlycosylationN-Glycosylation ProcessProcess (NGPNGP)Cell Culture
Glycoprotein Fraction
Glycopeptides Fraction
extract
Separation technique I
Glycopeptides Fraction
n*m
n
Signal integrationData correlation
Peptide Fraction
Peptide Fraction
ms data ms/ms data
ms peaklist ms/ms peaklist
Peptide listN-dimensional arrayGlycopeptide identificationand quantification
proteolysis
Separation technique II
PNGase
Mass spectrometry
Data reductionData reduction
Peptide identificationbinning
n
1
![Page 18: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/18.jpg)
![Page 19: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/19.jpg)
Phase II: Ontology PopulationPhase II: Ontology PopulationPopulate ProPreO with all experimental
datasets?Two levels of ontology population for
ProPreO: Level 1: Populate the ontology with instances
that a stable across experimental runsEx: Human Tryptic peptides – 40,000 instances in
ProPreO Level 2: Use of URIs to point to actual
experimental datasets
![Page 20: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/20.jpg)
Ontology-mediated Proteomics Ontology-mediated Proteomics ProtocolProtocol
RAW Files
MassSpectrometer
ConversionTo
PKL
Preprocessing DB Search Post processing
Data Processing Application
Instrument
DBStoring Output
PKL Files (XML-based Format)‘Clean’ PKL FilesRAW Results File
Output (*.dat)
Micromass_Q_TOF_ultima_quadrupole_time_of_flight_mass_spectrometer
Masslynx_Micromass_application
mass_spec_raw_data
Micromass_Q_TOF_micro_quadrupole_time_of_flight_ms_raw_dataPeoPreO
produces_ms-ms_peak_list
All values of the produces ms-ms peaklist property are micromass pkl ms-ms peaklist
RAW Files
‘Clean’ PKL Files
![Page 21: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/21.jpg)
Semantic Annotation of Scientific DataSemantic Annotation of Scientific Data
830.9570 194.9604 2580.2985 0.3592688.3214 0.2526
779.4759 38.4939784.3607 21.77361543.7476 1.38221544.7595 2.9977
1562.8113 37.47901660.7776 476.5043
ms/ms peaklist data
<ms/ms_peak_list>
<parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer
mode = “ms/ms”/>
<parent_ion_mass>830.9570</parent_ion_mass>
<total_abundance>194.9604</total_abundance>
<z>2</z>
<mass_spec_peak m/z = 580.2985 abundance = 0.3592/>
<mass_spec_peak m/z = 688.3214 abundance = 0.2526/>
<mass_spec_peak m/z = 779.4759 abundance = 38.4939/>
<mass_spec_peak m/z = 784.3607 abundance = 21.7736/>
<mass_spec_peak m/z = 1543.7476 abundance = 1.3822/>
<mass_spec_peak m/z = 1544.7595 abundance = 2.9977/>
<mass_spec_peak m/z = 1562.8113 abundance = 37.4790/>
<mass_spec_peak m/z = 1660.7776 abundance = 476.5043/>
<ms/ms_peak_list>
Annotated ms/ms peaklist data
![Page 22: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/22.jpg)
Semantic annotation of Scientific Semantic annotation of Scientific DataData
Annotated ms/ms peaklist data
<ms/ms_peak_list>
<parameter
instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer”
mode = “ms/ms”/>
<parent_ion_mass>830.9570</parent_ion_mass>
<total_abundance>194.9604</total_abundance>
<z>2</z>
<mass_spec_peak m/z = 580.2985 abundance = 0.3592/>
<mass_spec_peak m/z = 688.3214 abundance = 0.2526/>
<mass_spec_peak m/z = 779.4759 abundance = 38.4939/>
<mass_spec_peak m/z = 784.3607 abundance = 21.7736/>
<mass_spec_peak m/z = 1543.7476 abundance = 1.3822/>
<mass_spec_peak m/z = 1544.7595 abundance = 2.9977/>
<mass_spec_peak m/z = 1562.8113 abundance = 37.4790/>
<mass_spec_peak m/z = 1660.7776 abundance = 476.5043/>
<ms/ms_peak_list>
![Page 23: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/23.jpg)
Formalize description and classification of Web Services using ProPreO concepts
Service description using WSDL-SService description using WSDL-S
<?xml version="1.0" encoding="UTF-8"?><wsdl:definitions targetNamespace="urn:ngp" …..xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<wsdl:types> <schema targetNamespace="urn:ngp“ xmlns="http://www.w3.org/2001/XMLSchema"> …..</complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> <wsdl:message name="replaceCharacterResponse"> <wsdl:part name="replaceCharacterReturn" type="soapenc:string"/> </wsdl:message>
WSDL ModifyDBWSDL-S ModifyDB
<?xml version="1.0" encoding="UTF-8"?><wsdl:definitions targetNamespace="urn:ngp" ……xmlns:wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics" xmlns:ProPreO="http://lsdis.cs.uga.edu/ontologies/ProPreO.owl" >
<wsdl:types> <schema targetNamespace="urn:ngp" xmlns="http://www.w3.org/2001/XMLSchema">……</complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest" wssem:modelReference="ProPreO#peptide_sequence"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> ProPreO
process Ontology
data
sequence
peptide_sequence
Concepts defined in
process Ontology
Description of a Web Service using:WebServiceDescriptionLanguage
![Page 24: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/24.jpg)
Summary, Observations, Conclusions• Ontology Schema: relatively simple in
business/industry, highly complex in science• Ontology Population: could have millions of assertions,
or unique features when modeling complex life science domains
• Ontology population could be largely automated if access to high quality/curated data/knowledge is available; ontology population involves disambiguation and results in richer representation than extracted sources, rules based population
• Ontology freshness (and validation—not just schema correctness but knowledge—how it reflects the changing world)
![Page 25: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/25.jpg)
Summary, Observations, Conclusions• Some applications: semantic search,
semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …
![Page 26: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the](https://reader036.vdocuments.us/reader036/viewer/2022062308/56649e5c5503460f94b5439f/html5/thumbnails/26.jpg)
More information at
• http://lsdis.cs.uga.edu/projects/glycomics