xianfeng jeff chen ph.d . research investigator/project manager
DESCRIPTION
Overview and Implementation Strategy of the NIAID-Funded Bio-defense Proteomics Database System. Xianfeng Jeff Chen Ph.D . Research Investigator/Project Manager. (1) Introduction. Agenda Today. VBI responsibility in Admin Center PRCs datatype and organism - PowerPoint PPT PresentationTRANSCRIPT
Xianfeng Jeff Chen Ph.D.
Research Investigator/Project Manager
Overview and Implementation Strategy of Overview and Implementation Strategy of the NIAID-Funded Bio-defense the NIAID-Funded Bio-defense Proteomics Database SystemProteomics Database System
• VBI responsibility in Admin Center
• PRCs datatype and organism
• Proteomics data submission and storage work flow
• VBI computing system architecture (CPU and storage)
• VBI database system prototype and functionality
• VBI existing database schema and status
• Example Y2H schema for design logics and case study
• Proposed data integration and knowledgebase construction
Agenda TodayAgenda Today
(1) Introduction
(2) Database Development
(3) Strategy on Knowledgebase Development
IntroductionIntroduction
Proteomics Data ManagementProteomics Data Management
(processed data)
Tasks of Proteomics Data Management
RAWDATA
Data Storage& Visualization
Tools(VBI)
Analysis,Annotation,& Curation
(GU)
DataQA/QC,
Interoperability (VBI/GU)
SOP, LIMS, & Adm DB
(SSS)
University of Michigan Microarray and mass spectrometry
Caprion Mass spectrometry
Harvard Proteomics Institute Genomics and protein expression array
Albert Einsten College of Medicine Mass spectrometry
PNNL Mass spectrometry
Scripps NMR structural and X-ray crystal diffraction data
Myriad Genetics Yeast two-hybrid system
PRCs Major Data TypePRCs Major Data Type
Organization Major Data Type
PRCs OrganismsPRCs Organisms
Einstein Toxoplasma gondii, Cryptosporidium parvum
Caprion Brucella abortus
Harvard Bacillus anthracis (Protein array), Vibrio cholerae
Myriad Bacillus anthracis (Y2H), Yersinia pestis,
Francisella tularensis, vaccinia
PNNL Orthopox (vaccinia and monkeypox), Salmonella typhimurium, Salmonella typhi
Scripps SARS CoV
Michigan Bacillus anthracis (TXP, MS) + host (human)
Proteomics Data FlowProteomics Data Flow
PRCS
VBI
Public
Data Sources
2D GELS
Protein Array
LC
Immunoaffinity purification
Y2H
MS
MS/MS
NMR
X-Ray Cryoelectron Microscopy
X-Ray Defraction
etc…
Data Types
QA
&
QC
Quality Assurance
& Quality Control
Converting to Standard Format
Standard
Format
Standard Format for Each Data Type
QA
&
QC
Quality Assurance
& Quality Control
Data Modeling w/ Decomposition
Relational Database
MIAME and MIAPE-like Standards/SOP for Data Submission
Database Development
VBI Computing SystemVBI Computing System
Binary Software
Project
Proteomics
Genomics
Data Storage
PC Users
Jeff
Wei
Chaitanya
Chengdong
Ranjan
Oswald
Bruno
LINUX
SUN (Solaris)
Gimli
Elenwe
7 PRCsNetworked File Server
TUOR Relational Database Server
ProteomicsChendong, Jeff, Wei, Ranjan, Chaitanya
Web Server
Application Server
Development Test/Stage Production
Web Interface
Database
System Development in Q3 of 2005
Production: http://proteinbank.vbi.vt.edu/bprc
Test: http://proteinbankdev.gepasi.org/bprc/
Development: http://txue.bioinformatics.vt.edu:8080/bprc http://wsun.vbi.vt.edu:8080/bprc/
Proteomics Database Project Websites
Dynamically generated webpage Dynamically generated webpage
(1) Account management
(2) File and doc management
(3) News group and news update
(4)Textual data display
(5) 2D gel Image data display
(6) Table and record query
(7) Data uploading and simple submission
(8)HTTP data downloading
(9)SFTP file transfer
Production Website InstanceProduction Website Instance
Functionalities:Functionalities:
Search By Experiment
•Select Experiment•Retrieve list of Bait protein and nucleotide, Prey protein & nucleotide•Links to details of bait and Prey example: Drosophila melanogaster
Search By Organism
•Escherichia coli•Saccharomyces cerevisiae•Homo sapiens•Drosophila melanogaster•Helicobacter pylori•Caenorhabclitis elegans
Search By Data Type
•Proteomics •Genomics•Microarray
Database QueryDatabase Query
Search By Project/Experiment
•Scripps MS testing project•Available peptide hit list•Retrieve peak information and m/z & intensity list
Query for Scripps Sample Data Query for Scripps Sample Data
Search By Experiment/Sample
Query for 2 D Gel DataQuery for 2 D Gel Data
Proteomics Database ArchitectureProteomics Database Architecture
Process-Oriented Production Design
2D Gel
Y2H
MS NMR
Protein
Array
LC
X-Ray Cryoelectron Microscopy
Immnoaffinity
Purification
X-Ray Defraction
Multiple Schemas of Disparate Data
Consolidate to One Schema to Remove
Redundancy
Stored Procedure for Analysis
Pipeline
Physical Layer
Logical Layer
Views -- materialized views
Final Views
Application
Layer
Three Phases of Database DesignThree Phases of Database Design
Normalized with Key-value Pair
Proteomics Database ArchitectureThree Database Instances
Proteomics Database ArchitectureThree Database Instances
Phase 1
Version 1
0.5-1 year
Disparate Data
With Multiple Schemas
Individual Dataset Modeling
Phase 2
Version 2
1-1.5 year
Consolidation into a Few Schema
A normalized data model
implemented as key –value pairs, highly
decomposed.
Phase 3
Version 3
2 years
Analysis Pipeline
Procedures
Logical Layer with Views for the User
Physical Layer
1. Partially Processed Data
2. Data Enhanced with Knowledge
3. Interface Less Changeable
4. Curated/Annotated Data
Development
Test/stage
Production
Status of VBI Database DevelopmentStatus of VBI Database Development
Schema Development Test/stage Production
Adm +(10/10) + +
2 D Gel +(10/10) + +
MS +(10/10) + +
Interaction +(9/10) + -
Pathway +(7/10) + -
Data Repository +(8/10) + +
Y2H +(10/10) + +
Genomics +(10/10)(GUS) + +
Microarray +(10/10) (AE) + +
Default Tablespace: Admin_data, Genomics_TBLS, Pathway_TBLS,
Microarray_TBLS, Proteomics_TBLS.
(Maturity)
Who (People)
Where (Organization)
Project (Goal)
Materials and Methods (Metadata)
Results (Raw Data)
Conclusion and Hypothesis (Processed and Analyzed Data)
Generic Experiment Data Components-------Example of Database Design Logics
People
Experiment
Project
Sample
ResultsConclusion HypothesisDNA /Protein
Detail
Y2H Data Component Modeling
Experiment
Experiment Design
Experiment Factor
Factor Value
Design Description
Ontology Entry
Ontology entries are taking care of the annotation cases1) There are diverse choices and there exist ontologies that can better capture the information 2) What are essentially controlled vocabularies which are limited in number of choices but might grow in the future or vary by technology type
Experiment Component Object Model
Y2H Partial Database Schema
Proteomics DB System Architecture
Public File Server
Private File ServerOracle Relational Database
JDBC,
Perl DBI/DBD,
ODBC
Batch Processing
(1) Data uploading;
(2) Data validation;
(3) Data analysis;
(4) Data processing
JSP, CGI,
Java
Perl,
Java
Virtual Database/ Warehouse
Application Layer
Web Display and Data Visualization
System Architecture of Putative VBI Proteomics KnowledgebaseSystem Architecture of Putative VBI Proteomics Knowledgebase
Security
Security
Security
Security
Temporary data
Service-Oriented MiddleWare with Process Control
Array Express Mass Spectrometry Two Component System 2D Gel Structure Data Genomics Data
------- Data, Tool, Project, and Team Interoperability------- Data, Tool, Project, and Team Interoperability
Strategy on Data Integration and
Construction of Knowledge Warehouse
Biological Information WorkflowBiological Information Workflow
Information Storage, Queries & DB Management
Cleaning, Processing Algorithms
Curation and Annotation of Data
Knowledge Generation
Biological Research
Target Discovery
Diagnostics, Therapeutics &
Vaccines
Data Management Knowledge Management
Bio-IT Scope Data IntegrationKnowledge generationKnowledge managementKnowledge presentation
Phase I Phase II Phase III
First 2 years 3rd-4th years 5th year
•Raw data management•Schema development•Data visualization•Data standardization
•Integration at interface level•Integration of data at DB level•Interoperability of datasets•Normalization and warehousing
•Predefined query•Materialized view •Comparative analysis•Statistical analysis
VBI PDC Project PhasesVBI PDC Project Phases
(2) Mass spectrometryAllows identification of proteins within large complexes (2-100 proteins).Lower throughput.
(1) Yeast two-hybrid systemMeasures association between two proteins.Allows very high throughput.
Mapping the ProteomeMapping the Proteome
ComplexInteraction
Model
R2H Analysis
N-ary interationsPO4
Proteins MS Analysis
Binary interactions
Infer Complex Interaction TopologyInfer Complex Interaction Topology
Knowledgebase
(1) Completed Genome
Ames, Ames Ancestor, a2012 NCBI, TIGR
(2) Yeast two-hybrid interaction data Myriad Genetics
(3) Mass Spectrometry Scripps and Caprion
(4) Microarray expression profiling Univ. of Michigan
(5) Interspecies and interspecies clustering NCBI(COG) and TIGR
(6) Functional category assignment GU(PIR)
Data Organization
Bacillus anthracisBacillus anthracis
(1) Annotation Improvement
(1) Non-homologous based methods -------------- phylogenetic profiling,
Rosetta stone pattern,
operon analysis,
co-expression profiling,
gene neighboring etc.
(2) Comparative genomics with two reference genomes --- E. Coli and Yeast
(2) Identifying anchor points for data integration
(1) Known metabolic pathway – E. coli and yeast;
(2) Known signal transduction pathway;
(3) Known Gene regulation machinery;
(4) Known Protein-protein interaction map.
Strategy for Knowledgebase Construction Strategy for Knowledgebase Construction
Data IntegrationData Integration
Genomics Data
Improved annotation
Comparative Genomics
Anchor on knowledge network of
Reference Genomes – E. Coli and Yeast
Lay down Y2H interaction data and expend network
Lay down MS multiple interaction data to expend the network
Lay down microarray data to add co-expression pattern to gene network
http://www.Bacillus_anthracis.org
Putative Knowledgebase:
No thing
Key: Multi-Protein ComplexCuratedIn-House Y2HBoth Curated + Y2H
Data Mining and Knowledge Augmentation
Data Mining and Knowledge Augmentation
Literature Y2H analysis MS analysis Microarray
Dr. Jeff Chen Project Manager/Investigator VBIDr. Chendong Zhang Senior Software Engineer VBIDr. Steve Cammer Bioinformatics Scientist VBIDr. Oswald Crasta Scientist and CI-Co-director VBISusan Baker DBA VBIJiang Lu DBA VBIRanjan Jha Software Engineer VBIQiang Yu Software Engineer VBIJian Li Software Engineer VBIWei Sun Software Engineer VBIChaitanya Kommidi Software Engineer VBIDr.Bruno Sobral Co-PI VBIDr. Peter MacGarvey Senior Bioinformatics Scientist GUDr. Cathy Wu Co-PI GUPaula Yadvish Web Coordinator SSSMargaret Moore PI SSS
AcknowledgementAcknowledgementName Role Organization