algorithms, data structures and web computing for data ... · pinto fr, carrico ja, ramirez m,...
TRANSCRIPT
Algorithms, data structures and web computing
for data mining in biomedicine
Jonas S Almeida
Dept Bioinformatics and Comp. Biol.
Univ Texas MDAnderson Cancer Center
OECD Workshop on Knowledge Markets in the Life Sciences
16-17 October 2008
Silva S, Gouveia-Oliveira R, Maretzek A, Carrico J, Gudnason T, Kristinsson KG,
Ekdahl K, Brito-Avo A, Tomasz A, Sanches IS, Lencastre Hd H, Almeida JS (2003)
EURISWEB - Web-based epidemiological surveillance of antibiotic-resistant
pneumococci in Day Care Centers.BMC Med Inform Decis Mak. 2003 Jul 8,3(1):9.
[PMID:12846930]
Wang X, R Gorlitsky, and JS Almeida (2005) From XML to RDF: How Semantic
Web Technologies Will Change the Design of „Omic‟ Standards. Nature
Biotechnology, Sep;23(9):1099-103 [PMID:16151403].
Almeida JS, C Chen, R Gorlitsky, R Stanislaus, M Aires-de-Sousa, P Eleutério,
JA Carriço, A Maretzek, A Bohn, A Chang, F Zhang, R Mitra, GB Mills, X Wang,
HF Deus (2006) Data integration gets 'Sloppy'. Nature Biotechnology
24(9):1070-1071. [PMID:16964209].
Deus FH, R Stanislaus1, DF Veiga, C Behrens, II Wistuba, JD Minna, HR
Garner, SG Swisher, JA Roth, AM Correa, B Broom, K Coombes, A Chang, LH
Vogel, JS Almeida (2008) A Semantic Web management model for integrative
biomedical informatics. PLoS ONE. Aug 13;3(8):e2946 [PMID: 18698353].
Reference Papers on integrative infrastructure
This outcome was anticipated right at the onset of the Web [recall Tim Berners-Lee “weaving the web”]
Desired key features of a web-based data management system:
1. Syntactic interoperabilityAbility to get the data once told where it is.
2. Semantic interoperabilityAbility to use the data for a different purpose than the one that dictated its generation.
The path backwards.
Model IDVariable Selection
Discovery
Models,
transfer functions
[ y= f(x) ]
Boosting,
evolutionary algorithms,
exhaustive search
[ x X ]
Self-described structures,
Ontologies, RDF,
Description Logic, S3DB.
[ x [X,Z] ]
Models ----------------------- Tools ---------------------------------- Software Environment
#14. Almeida, J.S., M.A.M.Reis, M.J.T.Carrondo (1997) A Novel Unifying Kinetic Model of Denitrification. J. Theor. Biol. 186:241-249. [doi:10.1006/jtbi.1996.0352]
#31. Wolf G. Almeida JS. Pinheiro C. Correia V. Rodrigues C. Reis MAM. Crespo JG. (2001) Two-dimensional fluorometry coupled with artificial neural networks: A novel method for on-line monitoring of complex biological processes. Biotechnology & Bioengineering. 72(3):297-306.[PMID:11135199]
#36. Almeida, JS (2002) Predictive non-linear modeling of complex data by artificial neural networks. Curr. Op. Biotech. 13(1) 72-76.[PMID:11849962]
#68. Mikhitarian, K., Gillanders, W.E., Almeida, J.S., Hebert Martin R., Varela J.C., Metcalf, J.S., Cole, D.J., and Mitas, M. (2005) An innovative microarray strategy identities informative molecular markers for the detection of micrometastatic breast cancer. Clinical Cancer Research 11(10):3697-704. [PMID:15897566]
#72. Almeida JS, DJ McKillen, YA Chen, PS Gross, RW Chapman, G Warr (2005) Design and Calibration of Microarrays as Universal Transcriptomic Environmental Biosensors. Comparative and Functional Genomics, 6(3):132-137(6). [doi:10.1002/cfg.466].
#77. Garcia S.P., Jonas S. Almeida, JS (2005) Multivariate phase space reconstruction by nearest neighbor embedding with different time delays, Physical Review E 72, 027205. [PMID:16196759].
#78. Oates JC, Varghese S, Bland AM, Taylor TP, Self SE, Stanislaus R, Almeida JS, Arthur JM (2005) Prediction of urinary protein markers in lupus nephritis. Kidney Int. Dec;68(6):2588-92 [PMID:16316334].
#86. Geli P, P Rolghamre, JS Almeida, K Ekdahl (2006) Modeling Pneumococcal Resistance to Penicillin in Southern Sweden Using Artificial Neural Networks. Microbial Drug Resistance 12(3):149-157. [PMID:17002540]
#95. Wolf G, JS Almeida, JG Crespo, MA Reis (2007) An improved method for two-dimensional fluorescence monitoring of complex bioreactors. J Biotechnol. 128(4):801-12. [PMID:17291616].
#103. Sá-Leão R, Nunes S, Brito-Avô A, Alves CR, Carriço JA, Saldanha J, Almeida JS, Santos-Sanches I, de Lencastre H. (2008) High rates of transmission of and colonization by Streptococcus pneumoniae and Haemophilus influenzae within a day care center revealed in a longitudinal study. J Clin Microbiol. Jan;46(1):225-34. [PMID: 18003797]
Model ID
Lesson learned: predictive
independent variables are a
needle in the haystack.
2/5
Model IDVariable Selection
#63. Almeida JS, R Stanislaus, E Krug, J Arthur (2005) Normalization and Analysis of residual variation in 2D Gel Electrophoresis for quantitative differential proteomics. Proteomics 5(5):1242-9 [PMID:15732138].
# 64. Mitas M, JS Almeida, K Mikhitarian, WE Gillanders, DN Lewin, DD Spyropoulos, L Hoover, A Graham, T Glenn, P King, DJ Cole, R Hawes, CE Reed, BJ Hoffman (2005) Accurate discrimination of Barrett’s esophagus and esophageal adenocarcinoma using a quantitative three-tiered algorithm and multi-marker real-time RT-PCR. Clin Cancer Res. 2005 Mar 15;11(6):2205-14 [PMID:15788668].
#83. Mueller M, Wagner CL, Annibale DJ, Knapp RG, Hulsey TC, Almeida JS (2006) Parameter selection for and implementation of a web-based decision-support tool to predict extubation outcome in premature infants. BMC Medical Informatics and Decision Making 6:11 [PMID:16509967].
#87. Almeida JS, Oates JC, Arthur JM. (2006) The need for concurrent calibration and discrimination statistics in predictive models. Kidney Int. 70(1):231-2. [doi:10.1038/sj.ki.5001519].
#89. Carrico JA, Silva-Costa C, Melo-Cristino J, Pinto FR, de Lencastre H, Almeida JS, Ramirez M. (2006) Illustration of a common framework for relating multiple typing methods by application to macrolide-resistant Streptococcus pyogenes. J Clin Microbiol. 44(7):2524-32. [PMID:16825375].
#91. Almeida, J.S., S.Vinga (2006) Computing distribution of scale independent motifs in biological sequences. Algorithms for Molecular Biology. 1:18. [PMID:17049089].
#96. Pinto FR, Carrico JA, Ramirez M, Almeida JS. (2007) Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement. BMC Bioinformatics 8(1):44. [PMID:17286861].
#102. Vinga S, Almeida JS. (2007) Local Renyi entropic profiles of DNA sequences. BMC Bioinformatics. 2007 Oct 16;8(1):393. [PMID: 17939871]
Lesson learned: critical co-variables are often found in other
haystacks.
3/5
Model IDVariable Selection
Discovery
#72. Almeida JS, DJ McKillen, YA Chen, PS Gross, RW Chapman, G Warr (2005) Design and Calibration of Microarrays as Universal Transcriptomic Environmental Biosensors. Comparative and Functional Genomics, 6(3):132-137(6). [doi:10.1002/cfg.466].
#76. Wang X, R Gorlitsky, and JS Almeida (2005) From XML to RDF: How Semantic Web Technologies Will Change the Design of ‘Omic’ Standards. Nature Biotechnology, Sep;23(9):1099-103 [PMID:16151403].
#84. Karpievitch YV, Almeida JS (2006) mGrid: A parallel Matlab library for user code distribution. BMC Bioinformatics 7:139 [PMID:16539707].
#90. Almeida JS, C Chen, R Gorlitsky, R Stanislaus, M Aires-de-Sousa, P Eleutério, JA Carriço, A Maretzek, A Bohn, A Chang, F Zhang, R Mitra, GB Mills, X Wang, HF Deus (2006) Data integration gets 'Sloppy'. Nature Biotechnology 24(9):1070-1071. [PMID:16964209].
#101. Vilela M, Borges CC, Vinga S, Vanconcelos AT, Santos H, Voit EO, Almeida JS. (2007) Automated smoother for the numerical decoupling of dynamics models. BMC Bioinformatics 8(1):305. [PMID: 17711581]
#104. Stanislaus R, JM Arthur, B Rajagopalan, R Moerschell, B McGlothlen, JS Almeida (2008). An open-source representation for 2-DE-centric proteomics and support infrastructure for data storage and analysis, BMC Bioinformatics. Jan 7;9:4. [PMID: 18179696]
Lesson learned: more than domain specific models or tools, integrative
research requires a Knowledge Engineering environment.
The critical characteristic of that environment is semantic interoperability for
both data and tools. Lack of syntactic interoperability is inexcusable.
4/5
rel0
Rules
rel1
rel2
rel3
rel4
rel5
rel6
Statements
rel0
rel1
rel1
rel6
rel5
rel1
rel3
rel1
rel6
rel5
rel1
rel1
rel3
rel1
rel1
RDF - everything is a resourceRDF - everything is a resource
Wang X, R Gorlitsky, and JS Almeida (2005) From XML to
RDF: How Semantic Web Technologies Will Change the
Design of „Omic‟ Standards. Nature Biotechnology,
Sep;23(9):1099-103 [PMID:16151403].
E ER
Su
bje
ct
Re
lati
on
Ob
jec
t
Rules
Su
bje
ct
Un
iqu
e I
D
Re
lati
on
Ob
jec
t
Va
lue
Re
so
urc
e
Un
iqu
e ID
RulesStatements Resources
RDF
S3DB – user and project tables
Multiple project management
(Wang 2005)
www.s3db.org
Functional
considerations
Operational
considerationsS3DB – 3 table
single project
Almeida et. al (2006) Data integration gets 'Sloppy'. Nature Biotechnology 24(9):1070-1071.
S3DB:Project: Shultz
Rules:
<V2><Person><has><Name>
<V3><Dog><has><Name>
<V4><Person><has><Dog>
<V5><Person><has><Age>
Statements
<S12><P1><R6><V2>”Charlie Brown”
<S13><P1><R6><V4><R7>
<S14><P1><R6><V5>”56 years old”
<S15><P1><R7><V3>”Snoopy”
Resources
<R6> “This is Charile Brown”
<R7> “This is Snoopy, Charlie‟s Dog”
N3:<P#1><s3:project>”Shultz”.
<RC#8><s3:resource><P#1>,<s3:name>”Person”.
<RC#9><s3:resource><P#1>,<s3:name>”Dog”.
<V#2><s3:rule><P#1><s3:subject><RC#8>,<s3:verb>”has”,<s3:subject>”Name”.
<V#3><s3:rule><P#1><s3:subject><RC#9>,<s3:verb>”has”,<s3:subject>”Name”.
<V#4><s3:rule><P#1><s3:subject><RC#8>,<s3:verb>”has”,<s3:subject><RC#9>.
<V#5><s3:rule><P#1><s3:subject><RC#8>,<s3:verb>”has”,<s3:subject>”Age”.
<R#6><s3:rsrcInstance><RC#8>,<s3:notes>“This is Charlie Brown”.
<R#7><s3:rsrcInstance><RC#9>,<s3:notes>“This is Snoopy, Charlie‟s Dog”.
<S#12><V#2>[<R#6>,”Charlie Brown”].
<S#13><V#3>[<R#7>,”Snoopy”].
<S#14><V#4>[<R#6>,<#R7>].
<S#15><V#2>[<R#6>,”56 years old”].
rdfs
:subCla
ssO
f
rdfs:subClassOf rdfs:subClassOf
CollectionrojectP Item
[Cid] [Iid] [Cid or L]
rdf:o
bje
ct
rdf:p
redic
ate
rdf:s
ubje
ct
rojectP
Deployment
Deployment
Unique Identifiers of entities:
Durl rdf:type s3db:Deployment
Pid rdf:type s3db:Project
Cid rdf:type s3db:Collection
Rid rdf:type s3db:Rule
Sid rdf:type s3db:Statement
Iid rdf:type s3db:Item
Uid rdf:type s3db:User
Gid rdf:type s3db:Group
rdfs:subClassOf
rdf:predicate
rdf:
subje
ct
rdf:
obje
ct
rdf:
subje
ct
rdf:
obje
ct
[Iid] [Rid] [Iid or L]
User
Group
{Doublin Core:}
dc:created_by Uid
dc:created_on date
dc:service {term of cv}
etc …
Collection Item
Rule Statement
User
Group
rdf:o
bje
ct
rdf:p
redic
ate
rdf:s
ubje
ct
S3DB Entity (annotated using DC)
Relationship (defined using RDFS)
Permission (defined by s3db:permission)
Annotation of s3db entities:
Needed only if sharing with Project that is hosted by a distinct S3DBDeployment.
Rule Statement
Attribute Value
S3DBWebS3DB
Generic Web-basedGUI for S3DB
SpecializedApplications(stand alone)
SpecializedApplications(stand alone)
Web server at IBL
(I/O for machines)
Client machine
(in the lab)
ibl.m
dan
derso
n.o
rg
S3DB
rdfs
:subCla
ssO
f
rdfs:subClassOf rdfs:subClassOf
CollectionrojectP Item
[Cid] [Iid] [Cid or L]
rdf:o
bje
ct
rdf:p
redic
ate
rdf:s
ubje
ct
rojectP
Deployment
Deployment
Unique Identifiers of entities:
Durl rdf:type s3db:Deployment
Pid rdf:type s3db:Project
Cid rdf:type s3db:Collection
Rid rdf:type s3db:Rule
Sid rdf:type s3db:Statement
Iid rdf:type s3db:Item
Uid rdf:type s3db:User
Gid rdf:type s3db:Group
rdfs:subClassOf
rdf:predicate
rdf:
subje
ct
rdf:
obje
ct
rdf:
subje
ct
rdf:
obje
ct
[Iid] [Rid] [Iid or L]
User
Group
{Doublin Core:}
dc:created_by Uid
dc:created_on date
dc:service {term of cv}
etc …
Collection Item
Rule Statement
User
Group
rdf:o
bje
ct
rdf:p
redic
ate
rdf:s
ubje
ct
S3DB Entity (annotated using DC)
Relationship (defined using RDFS)
Permission (defined by s3db:permission)
Annotation of s3db entities:
Needed only if sharing with Project that is hosted by a distinct S3DBDeployment.
Rule Statement
Attribute Value
Snapshots of interfaces using S3DB‟s API
(Application Programming Interface). These
applications exemplify why the semantic web
designs can be particularly effective at enabling
generic tools to assist users in exploring data
documenting very specific and very complex
relationships. Snapshot A was taken from
S3DB‟s web interface, which is included in the
downloadable package. This interface was
developed to assist in managing the database
model and, therefore, is centered on the
visualization and manipulation of the domain of
discourse, its Collections of Items and Rules
defining the documentation of their relations.
The application depicted on snapshots B-D
describe a document management tool
S3DBdoc, freely available as a Bioinformatics
Station module (see Figure 6). The navigation
is performed starting from the Project (C), then
to the Collection (B) and finally to the editing of
the Statements about an Item (D). The
snapshot B illustrates an intermediate step in
the navigation where the list of Items (in this
case samples assayed by tissue arrays, for
which there is clinical information about the
donor) is being trimmed according to the
properties of a distant entity, Age at Diagnosis,
which is a property of the Clinical Information
Collection associated with the sample that
originated the array results. This interaction
would have been difficult and computationally
intensive to manage using a relational
architecture. The RDF formatted query result
produced by the API was also visualized using
a commercial tool, Sentient Knowledge
Explorer (IO-Informatics Inc), shown in
snapshot E, and by Welkin, F, developed by the
digital inter-operability SIMILE project at the
Massachusetts Institute of Technology. See
text for discussion of graphic representations by
these tools. To protect patient confidentiality
some values in snapshots B and D are
scrambled and numeric sample and patient
identifiers elsewhere are altered.
exfoliatins104
enterotoxins103
ClfB102
LN2 viability test101
institution100
antibiotic consumption97
MRSE frequency96
MRSA frequency95
Plasmid analysis81
mechanism and genes74
target73
name63
number of children62
DCC61
bed size60
specialty59
category58
SCCmec typing57
Rep-PCR56
Dot-blot55
LN2 freezing54
patient clinical data53
Hospital52
final classification51
species and tests50
code49
indoor area48
outdoor area47
number of employees46
number of rooms45
country, city44
country, state/province/county, city43
-80oC42
isolate reference41
susceptibility40
ITQB isolate39
MIC38
alternative name37
3-4 letter code36
name35
country, state/province/county, city34
PCR genes amplification33
Agr32
susceptibility31
beta-lactamase30
isolates from same subject29
MIC28
setting, hospital/DCC/heard, service/room, ICU27
project, period26
collection date25
disk inhibition24
subject type23
full name22
class21
abbreviation20
Antibiotic19
SmaI hybridization bands18
Phagetyping17
Ribotyping16
other15
hemolysins14
leukocidins13
project, station12
disk inhibition11
PFGE10
ClaI-mecA::Tn5549
MLST8
patient (or subject) demographic data7
patient admittance data6
collection site5
RAPD4
monthly fee3
Doubling time2
Spa typing1
Entity#
exfoliatins104
enterotoxins103
ClfB102
LN2 viability test101
institution100
antibiotic consumption97
MRSE frequency96
MRSA frequency95
Plasmid analysis81
mechanism and genes74
target73
name63
number of children62
DCC61
bed size60
specialty59
category58
SCCmec typing57
Rep-PCR56
Dot-blot55
LN2 freezing54
patient clinical data53
Hospital52
final classification51
species and tests50
code49
indoor area48
outdoor area47
number of employees46
number of rooms45
country, city44
country, state/province/county, city43
-80oC42
isolate reference41
susceptibility40
ITQB isolate39
MIC38
alternative name37
3-4 letter code36
name35
country, state/province/county, city34
PCR genes amplification33
Agr32
susceptibility31
beta-lactamase30
isolates from same subject29
MIC28
setting, hospital/DCC/heard, service/room, ICU27
project, period26
collection date25
disk inhibition24
subject type23
full name22
class21
abbreviation20
Antibiotic19
SmaI hybridization bands18
Phagetyping17
Ribotyping16
other15
hemolysins14
leukocidins13
project, station12
disk inhibition11
PFGE10
ClaI-mecA::Tn5549
MLST8
patient (or subject) demographic data7
patient admittance data6
collection site5
RAPD4
monthly fee3
Doubling time2
Spa typing1
Entity#
Day 5
Day 17
Day 365
Ontology-centric web client
S3DB is equipped with REST application programming interface (API), that is, client applications can be easily weaved by composing URL calls with variable values.
A year A year
in the life of in the life of
a semantic a semantic
databasedatabase
A year A year
in the life of in the life of
a semantic a semantic
databasedatabase
• Seeding: The first stage of usage of the semantic database is characterized by a focus on the domain of discourse. In this seeding stage many Rules are inserted without validation by submission of actual data (Statements).
• Seeding: The first stage of usage of the semantic database is characterized by a focus on the domain of discourse. In this seeding stage many Rules are inserted without validation by submission of actual data (Statements).
Time (days)
Day 152
Growth: This third pattern of usage is much longer than the previous two and corresponds to a relative light activity editing the domain of discourse while, on the contrary, an intensification of the database access by the target community of users. This is distinct from the preceding Calibration state where data submission is frequently aided or even mediated by the database developers.
• Maturation: The end of the data acquisition program that motivated the creation of the database is sometimes associated with a decrease in the insertion of new data (Statements) and a near stop in the editing of the domain of discourse (Rules). This period of maturation therefore produces a stable data service that remains useful and is accessed regularly. We found this period to be ideal for harvesting: exporting the database schema for analysis of the knowledge domain, including the designing of intuitive Graphic User Interfaces.
Document-centric clients
… and client side applications can be easily developed, relying only on the
REST protocol to interoperate with the S3DB DBMS service.
S3DB is being used for a variety of molecular epidemiology domains, for
example, for Cancer Research:
Day 25
Sessio
ns
0 100 200 300 400 500 600 700 800 900 1000
Rule
s
0 10 20 30 40 50 60 70
Users
0
5
10
15
20
25
Statements per rule
0
500
1000
1500
2000
2500
0
50
100 15
0
20
0
250
300
35
0
• Calibration: once the submission of data triples (Statements) intensifies, the seed data model is reconsidered and is significantly edited. This second stage is characterized by heavy activity both regarding expanding or updating the domain of discourse and also regarding submission of data. We found this to be the right time to engage the user community with training programs.
• Calibration: once the submission of data triples (Statements) intensifies, the seed data model is reconsidered and is significantly edited. This second stage is characterized by heavy activity both regarding expanding or updating the domain of discourse and also regarding submission of data. We found this to be the right time to engage the user community with training programs.
a) Manual data input and retrieval
b) Automatic data submission by BiS applications at high throughput
screening facilities.
c) Deamon application using S3DB as a web-service. These are typically BiS modules, open source bioinformatics applications or R scripts.
d) Public data and web services,
for example, at NCBI, Cancer
Genome Atlas, etc
Bioinformatics Station (BiS) Server
Semantic database (S3DB) server
Available for download athttp://bioinformaticstation.org Available for download at
http://S3DB.org
for the same functionality as web-applications see prototype at docs.s3db.org
Code Distribution
BiS
SAAS Data Service
Client App.
Distributed Semantic DBMS
S3DB
Ontology-driven web-service oriented architecture
Composite web-based applications
Desired key features of a web-based information management system:
1. Syntactic interoperabilityAbility to get the data once told where it is.
2. Semantic interoperabilityAbility to use the data for a different purpose than the one that dictated its generation.
RESTful WOA
SPARQL endpoints (reified to native API exposed through REST)Separation of domain of discourse from its instatiationPermission migration built-in core data model