![Page 1: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/1.jpg)
APAN e-Science Workshop
e-Bio System for Bio-Knowledge Discovery
2003.8.27Sangsoo Kim
Nat’l Genome Informat’n Ct.Korea Res. Inst. of Biosci. & Biotech.
![Page 2: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/2.jpg)
Bio-Databases & Servers
• Contents– Bibliographic (Journal abstracts such as Medline)– Experimental data (Sequences or structures)– Results from annotation and analyses– Bioinformatic analysis tools
• Purpose– Storing & managing raw data– Querying for knowledge discovery– Sharing information with others– Serving others with online analysis
![Page 3: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/3.jpg)
New Role of Databases
• New discoveries of biological knowledge are published in scientific journals
• But journal space is limited and not suitable to publish large amount of high throughput data
• The supplementary information is provided in an accompanying website
• Readers can download the supplementary information and analyze from different aspect
• Combination with other information may surprise with unexpected results
• Journal publishers require supplementary information deposited in public archives
![Page 4: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/4.jpg)
Example - Nucleotide Sequence Repositories
• Nucleotide sequences discovered by sequencing experiments are deposited in any one of the public archives and the journal paper list the accession numbers only (without deposition, you cannot publish sequence discovery in journals)
• Public archives are– DDBJ operated by CIB, NIG in Japan– EMBL operated by EMBL-EBI in UK– GenBank operated by NCBI, NIH in USA
• The contents of these archives are exchanged daily and freely accessible to everybody
• Now extended to archive DNA chip data as well
![Page 5: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/5.jpg)
Growth of GenBankA Nucleotide Sequence Repository
Human Genome Project
![Page 6: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/6.jpg)
RTFM
Entrez: Home Page
![Page 7: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/7.jpg)
GenBank as HTML
Entrez: Display
FASTA as HTML
![Page 8: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/8.jpg)
Example – BLAST Servers
• Originally developed to compare my sequence to those in the repository in order to check whether mine is novel or not
• Extended to detect distantly related sequences, serving as the major sequence annotation tool
• Servers accept various kinds of queries and return alignment results over WWW
• The most widely used bioinformatic tool• For the analysis of many sequences, better to use local installati
on
![Page 9: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/9.jpg)
http://www.ncbi.nlm.nih.gov/BLAST
program query database
blastn dna dna
blastp protein protein
blastx dna (6x) protein
tblastn protein dna (6x)
tblastx dna (6x) dna (6x)RTFM
BLAST (Basic Local Alignment Sequence Tool)
![Page 10: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/10.jpg)
Descriptions Alignments
BLASTN (Cont'd)
![Page 11: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/11.jpg)
Example – Derived Databases
• Swiss-Prot & PIR– Proteins are predicted from deposited nucleotide sequences,
either being mRNA or genomic DNA– Functions and features of the protein is annotated manually
by experts• Protein motifs
– Prosite, pfam, BLOCKS, InterPro– Keyword querying and motif detection of user’s sequence
• Gene Ontology– Hierarchical organization of biological terms– Cataloging associated gene products
![Page 12: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/12.jpg)
Expert Protein Analysis System
ExPASy (http://www.expasy.ch)
![Page 13: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/13.jpg)
NiceProt View
![Page 14: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/14.jpg)
Gene Ontology
• Systematic classification of biological terminology– Molecular function– Biological process– Cellular component
• Controlled vocabulary• Associated GENE list
![Page 15: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/15.jpg)
![Page 16: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/16.jpg)
Data Mining
• Objective:– Discovery of (biological) knowledge by querying information in
the databases and comprehending it• Problems:
– Too many databases– Different protocols for access– Lack of standards– Poor quality or propagation of errors
• Solutions:– Data warehousing or federated databases
![Page 17: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/17.jpg)
Catalog of Bio-DBs arranged by Data Domain
![Page 18: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/18.jpg)
Database of Databases
• Data warehousing– Collect all databases by mirroring– Store in a unified format– Entrez (NCBI) or SRS (EBI)– Powerful but heavy maintenance load
• Federated databases– Maintained by participating members– Accessed by common protocols– Bio-DAS or Web Services via SOAP/XML– Next generation technology, but dependent on both the coop
eration by members and Internet bandwidth
![Page 19: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/19.jpg)
www.ngic.re.kr
![Page 20: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/20.jpg)
www.ncbi.nih.gov/LocusLink
![Page 21: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/21.jpg)
New Data Types
• Textual– Nucleotide or amino acid sequences– Associated feature annotation– Bibliographical texts
• Numeric– Gene expression profiles– Results from statistical analysis
• Graphical– Protein-protein interaction network– Genetic network– Biochemical reaction pathways
![Page 22: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/22.jpg)
![Page 23: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/23.jpg)
![Page 24: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/24.jpg)
![Page 25: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/25.jpg)
Building a Nation from a Land of City States
Lincoln D. SteinCold Spring Harbor
Laboratory
![Page 26: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/26.jpg)
Italy in the Middle Ages
![Page 28: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/28.jpg)
Making Easy Things Hard
Give me all human sequences submitted to GenBank/EMBL l
ast week.
![Page 29: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/29.jpg)
Lots of ways to do it
• Download weekly update of GenBank/EMBL from FTP site
• Use official network-based interfaces to data:– NCBI toolkit– EBI CORBA & XEMBL servers
• Use friendly web interfaces at NCBI, EBI
![Page 30: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/30.jpg)
Perl/Java/Python to the Rescue
• One script to do the web fetch• Another to parse the file format• A third to move into private
database• A fourth to repeat this weekly• Result:
– 6,719 scripts that do the same thing– None of them work together
![Page 31: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/31.jpg)
What’s Wrong with This?
• My EMBL fetcher is poorly documented so you write your own
• Your fetcher won’t work with my parser• My parser won’t work with your fetcher• We’ve now wasted 20 hours rather than
10• Multiply this by 6,719
![Page 32: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/32.jpg)
What’s else is Wrong?
• NCBI/EBI tweaks something• 6,719 scripts fail at once• 6,719 bioinformaticists tear their hair• 21,261 biologists curse the bioinformati
cists• 6,719 bioinformaticists curse their own
existence
![Page 33: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/33.jpg)
Unifying Bioinformatics Services
MIMBD: Meetings on the Interconnection of Molecular Biology Databases
Federated models: Gaea, KleisliData warehouses: GUS, MODs, Ense
mbl, UCSCAd hoc web servicesFormal web services
![Page 34: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/34.jpg)
Ad hoc services
BioXXX
Your Script
Conf file
![Page 35: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/35.jpg)
Formal Web Services
SeqFetchService
BLATService
MicroarrayService
BLASTService
SeqFetchService
GOService
![Page 36: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/36.jpg)
Formal Web Services
ServiceRegistry
SeqFetchService
BLATService
MicroarrayService
BLASTService
SeqFetchService
GOService
![Page 37: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/37.jpg)
Formal Web Services
Your Script
ServiceRegistry
BioXXX MicroarrayService
SeqFetchService
BLATService
MicroarrayService
BLASTService
SeqFetchService
GOService
![Page 38: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/38.jpg)
Technical Infrastructure is Here*
• Common vocabulary: GO• Transport format: XML• Data definition language: XSD• Wire protocol: SOAP• Service definition language: WSDL• Service registry: UDDI
*(almost)
![Page 39: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/39.jpg)
Distributed Annotation Systemhttp://www.biodas.org
Reference Server
AC003027AC005122M10154
Annotation Server Annotation Server
AC003027 M10154
WI1029 AFM820 AFM1126 WI443
AC005122
Annotation Server
Thursday 10:30 AMCanyon IV
![Page 40: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/40.jpg)
Europe, ca 2000
![Page 41: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/41.jpg)
Bioinformatics, ca 2010?
![Page 42: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/42.jpg)
NGIC
KNIH
Human
Proteome
AnimalAg-Bio
Crop
Plant
Microbial
Universities
ResearchInstitutes
Industry
Collection and Sharing of Collection and Sharing of National Genome InformationNational Genome Information
![Page 43: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062517/5681340f550346895d9aff3c/html5/thumbnails/43.jpg)
NGIC
KNIH
Human
Proteome
Animal
Ag-Bio
Crop
Plant
Microbial
Data Grid
KISTI ETRI
Application Grid
National Genome National Genome Information NetworkInformation Network