b07-genomecontent-biomart
TRANSCRIPT
BioMart 0.8 offers new tools, more interfaces, and increased flexibility through plugins
Junjun Zhang BOSC 2011, Vienna, Austria
July 15, 2011
2
BioMart: an open source federated data management system
• Widely used by public/private biological databases
• Quickly bring in-house data accessible online
• User friendly and flexible querying interfaces: web GUI and programmatic access API (REST, Perl, biomaRt etc)
• Automated data conversion tool
• Effortlessly federate in-house datasets with existing public BioMart datasets
www.biomart.org
3
BioMart 0.8 new features
• Integrated Java application makes it possible to build a BioMart data source, configure querying and presentation interfaces, and deploy a BioMart server from a single tool (MartConfigurator)
• Support more RDBMS (MS SQL Server, DB2, in addition to MySQL, PostgreSQL, and Oracle)
• Create ‘virtual mart’ from 3NF normalized source database without materialization
• New diverse Web GUIs and APIs provide added flexibility and ease of use
• Link indexing and parallel querying optimizations
• Support several security features (HTTPS, OpenID and oAuth protocols) for managing sensitive data
• Extendable plugin framework for analysis and visualization
4
Basic BioMart Concepts – the Power of Simplicity Building or querying a BioMart data source only requires understanding of a few basic concepts: • DataSource • DataMart • DataSet • A;ribute • Filter • AccessPoint (new) • Analysis (new) • Parameter (new)
BioMart hides complexity of underlie database schema and federaCon mechanism.
5
BioMart dataset is organized in a reverse star schema
6
3NF normalized database can be converted to reversed star schema
Source schema
Reverse star schema
7
BioMart system components
Query Engine / Plugin
Client-‐side Plugin
8
MartConfigurator – an integrated tool for setting up, configuring and managing a BioMart server
9
BioMart 0.8 provides several data querying GUIs MartForm
10
MartWizard
BioMart 0.8 provides several data querying GUIs
11
MartExplorer
BioMart 0.8 provides several data querying GUIs
12
Programmatic access API query syntax at the click of a button
13
Ensembl
KEGG Reactome
Mutation frequencies from cancer projects with data distributed around the globe
COSMIC
Pancreatic Expression Database (PED)
Breast Cancer Campaign Tissue Bank (BCCTB)
Special GUI - MartReport
14
Special GUI - MartAnalysis Mostly affected pathways
15
Special GUI – MartAnalysis
Sequence retrieval tool is implemented as server-side analysis plugin
Genomic sequence retrieval tool
16
New query type - Analysis Query against ‘affected_pathways’ analysis: <Query>
<Analysis name="affected_pathways" dataset="gene_oicrPanc"> <Parameter name="biotype" value="protein_coding"/> <Parameter name="file_type" value=”png"/> <Parameter name="img_height" value="8000"/> <Parameter name="img_width" value="12000"/> </Analysis>
</Query> Query against ‘gene_sequence’ sequence retrieval tool: <Query>
<Analysis name="gene_sequence"> <Parameter name="seq_type" value="gene_flank"/> <Parameter name="upstream_flank" value="500"/> </Analysis>
</Query>
17
Several large collaborative projects are using BioMart for data management
• BioMart Central Portal (http://central.biomart.org)
• International Cancer Genome Consortium (http://dcc.icgc.org) • POPCURE (collaboration with Pfizer, controlled access)
18
BioMart Central Portal (central.biomart.org)
First-‐of-‐its kind, community-‐driven effort to provide unified access to dozens of biological databases spanning genomics, proteomics, model organisms, cancer data, and more
19
BioMart Portal provides access to a collection of data sources
“Master/Slave” like
20
International Cancer Genome Consortium Data Portal
GOALS: To obtain a comprehensive description of genomic, transcriptomic, and epigenomic changes in 50 different tumor types and/or subtypes, which are of clinical and societal importance across the globe. 500 tumor and matched control samples will be analyzed per tumor type. At present, 12 countries joined ICGC. Data will be generated by institutions all over the world.
To make the data available rapidly and with minimal restrictions, to accelerate research of the causes and control of cancer.
AUSTRALIA Ovarian cancer
(Serous cystadenocarcinoma) Pancreatic cancer
(Ductal adenocarcinoma) Prostate cancer
MEXICO Multiple sub-types
FRANCE Breast cancer
(Subtype de!ned by an ampli!cation of the HER2 gene)
Liver cancer (Hepatocellular carcinoma) (Secondary to alcohol and adiposity) Prostate cancer
(Adenocarcinoma)
EU / FRANCE Renal cancer
(Renal cell carcinoma) (Focus on but not limited to clear cell subtype)
CANADA Pancreatic cancer
(Ductal adenocarcinoma) Prostate cancer
(Adenocarcinoma)
Bladder cancer Blood cancer
(Acute myeloid leukemia) Brain cancer
(Glioblastoma multiforme/ lower grade glioma)
Breast cancer (Ductal & lobular)
Cervical cancer (Squamous)
Colon cancer (Adenocarcinoma)
Endometrial cancer (Uterine corpus endometrial carcinoma) Gastric cancer
(Adenocarcinoma) Head and neck cancer
(Squamous cell carcinoma/ Thyroid carcinoma)
Renal cancer (Renal clear cell carcinoma/ Renal papillary carcinoma)
Liver cancer (Hepatocellular carcinoma)
Lung cancer (Adenocarcinoma/ squamous cell carcinoma)
Ovarian cancer (Serous cystadenocarcinoma)
Prostate cancer (Adenocarcinoma)
Rectal cancer (Adenocarcinoma)
Skin cancer (Cutaneous melanoma)
INDIA Oral cancer
(Gingivobuccal)
GERMANY Malignant lymphoma
(Germinal center B-cell derived lymphomas)
Pediatric brain tumors (Medulloblastoma and Pediatric pilocytic astrocytoma) Prostate cancer
(Early onset) JAPAN
Liver cancer (Hepatocellular carcinoma) (Virus-associated)
CHINA Gastric cancer
(Intestinal- and di"use-type)
UNITEDKINGDOM
Bone cancer (Osteosarcoma/ chondrosarcoma/ rare subtypes)
Breast cancer (Triple negative/lobular/ other)
Chronic Myeloid Disorders (Myelodysplastic syndromes, myeloproliferative neoplasms and other chronic myeloid malignancies) Esophageal cancer Prostate cancer
EU / UNITEDKINGDOM
Breast cancer (ER positive, HER2 negative)
UNITED STATES
SPAIN Chronic lymphocytic
leukemia (CLL with mutated and unmutated IgVH)
ITALY Rare pancreatic tumors
(Enteropancreatic endocrine tumors and rare pancreatic exocrine tumors)
21
ICGC Data Portal Architecture
“Peer-to-Peer” like
22
(dcc.icgc.org)
23
Future Directions
• Creation of BioMart Central Registry to improve coordination between BioMart servers. It will be a permanent resource where BioMart data providers can register their data models, data sources and services.
• Enhancing data transformation module for building BioMart databases from non-RDBMS data sources (e.g. flat data files, XML data files etc) with high scalability and flexibility.
• Enhancing the plugin system to allow various forms of
data analysis and visualization. Third parties are encouraged to develop plugins to extend the capabilities of the system.
24
The BioMart team
Joachim Baran Anthony Cros Jonathan Guberman Jack Hsu Yong Liang Elena Rivkin Bre; Whi;y Marie Wong-‐Erasmus Long Yao Syed Haider Junjun Zhang Arek Kasprzyk
For support: [email protected]