online queries to biomart webservices through the biomart package steffen durinck 1, wolfgang huber...

43
Online queries to biomart webservices through the biomaRt package Steffen Durinck 1 , Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge, UK

Upload: margaret-stephens

Post on 13-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Online queries to biomart webservices through the

biomaRt package

Steffen Durinck1, Wolfgang Huber2

1. NCI/NIH, Gaithersburg, Maryland, USA

2. EBI, Hinxton-Cambridge, UK

Page 2: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

BioMart• Generic data management system, collaboration

between EBI and CSHL • Several query interfaces and administration tools• Conduct fast and powerful queries using:

– website– webservice– graphical or text-oriented applications– software libraries written in Perl and Java.

• http://www.ebi.ac.uk/biomart/

Page 3: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

EnsemblJoint project between EMBL-EBI and the Sanger Institute

Produces and maintains automatic annotation on selected eukaryotic genomes.

http://www.ensembl.org

Page 4: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,
Page 5: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Ensembl martview

Page 6: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Ensembl martview

Page 7: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

VEGAThe Vertebrate Genome Annotation (VEGA) database is a central repository for high quality, frequently updated, manual annotation of vertebrate finished genome sequence.

Current release:• Human• Mouse• Zebrafish• Dog

http://vega.sanger.ac.uk

Page 8: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

WormBase

WormBase is the repository of mapping, sequencing and phenotypic information for C. elegans (and some other nematodes).

http://www.wormbase.org

Page 9: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

WormMart

Page 10: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

GrameneMart

Gramene is a curated, open-source, Web-accessible data resource for comparative genome analysis in the grasses.

http://www.gramene.org

Gramene: A Comparative Mapping Resource for Grains

Page 11: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,
Page 12: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Other databases with BioMart interfaces • dbSNP (via Ensembl)• HapMap• Sequence Mart: Ensembl genome sequences

Page 13: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

BioMart database schemata

Simple star-like schemata avoid complex joins and enablefast data retrieval

Page 14: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

BioMart user interfaces

Page 15: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

MartShellMartShell is a command line BioMart user interface based on a structured query language: Mart Query Language (MQL)

Page 16: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

BioMart user interfaces

Martview Web based user interface for BioMart, provides functionality for remote users to query all databases hosted by the EBI's public BioMart server.

MartExplorer

Perl and Java libraries

biomaRt interface to R/Bioconductor

Page 17: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

The biomaRt packageDeveloped by Steffen Durinck (started Feb 2005)

Two main sets of functions:

1. Tailored towards Ensembl, shortcuts for FAQs (frequently asked queries): getGene, getGO, getOMIM...

2. Generic queries, modeled after MQL (Mart query language), can be used with any BioMart dataset

Two communication protocols

1. Direct MySQL queries to BioMart database servers

2. HTTP queries to BioMart webservices more stable (across database releases); self-reflective; less

firewall problems

Page 18: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Getting started> library(biomaRt)> listMarts()

$biomart[1] "dicty" "ensembl" "snp" "vega" "uniprot" "msd" "wormbase"

$version[1] "DICTYBASE (NORTHWESTERN)" "ENSEMBL 38 (SANGER)" [3] "SNP 38 (SANGER)" "VEGA 38 (SANGER)" [5] "UNIPROT 4-5 (EBI)" "MSD 4 (EBI)" [7] "WORMBASE CURRENT (CSHL)"

$host[1] "www.dictybase.org" "www.biomart.org" "www.biomart.org" [4] "www.biomart.org" "www.biomart.org" "www.biomart.org" [7] "www.biomart.org"

$path[1] "" "/biomart/martservice" "/biomart/martservice"[4] "/biomart/martservice" "/biomart/martservice" "/biomart/martservice"[7] "/biomart/martservice"

Page 19: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Gene annotation

The function getGene allows you to get gene annotation for many types of identifiers

Supported identifiers are:– Affymetrix Genechip Probeset ID

– RefSeq

– Entrez-Gene

– EMBL

– HUGO

– Ensembl

– soon Agilent identifiers will also be available

Page 20: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

getGene> mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")> myProbes <- c("210708_x_at", "202763_at", "211464_x_at")> z <- getGene(id = myProbes, array = "affy_hg_u133_plus_2", mart = mart)

ID symbol1 202763_at CASP32 210708_x_at CASP107 211464_x_at CASP6 description1 Caspase-3 precursor (EC 3.4.22.-) (CASP-3) (Apopain) ...2 Caspase-10 precursor (EC 3.4.22.-) (CASP-10) (ICE-like apoptotic pro..7 Caspase-6 precursor (EC 3.4.22.-) (CASP-6) (Apoptotic protease Mch-2)...

chromosome band strand chromosome_start chromosome_end ensembl_gene_id1 4 q35.1 -1 185785845 185807623 ENSG000001643052 2 q33.1 1 201756100 201802372 ENSG000000034007 4 q25 -1 110829234 110844078 ENSG00000138794

ensembl_transcript_id1 ENST000003083942 ENST000002728797 ENST00000265164

Page 21: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Gene annotation

• Note:

Ensembl does an independent mapping of affy probe sequences to genomes. If there is no clear match then that probe is not assigned to a gene.

Page 22: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Gene annotation

• getGene returns a dataframe – Gene symbol– Description– Chromosome name– Band– Start position– End position– BioMartID

Page 23: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

getGene

> getGene(id = 100, type = "entrezgene", mart = mart)

ID symbol1 100 ADA description1 Adenosine deaminase (EC 3.5.4.4) (Adenosine aminohydrolase). [Source:Uniprot/SWISSPROT;Acc:P00813]

chromosome band strand chromosome_start chromosome_end ensembl_gene_id1 20 q13.12 -1 42681577 42713797 ENSG00000196839

ensembl_transcript_id1 ENST00000372874

Page 24: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Other functions

• getGO: GO id, GO term, evidence code

• getOMIM (Online Mendelian Inheritance in Man, a catalogue of human genes and genetic disorders): OMIM id, Disease, BioMart id

• getINTERPRO (an integrated resource of protein families, domains and functional sites): Interpro id, description

• getSequence

• getSNP

• getHomolog

Page 25: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

getSequence

> seq <- getSequence(species="hsapiens", chromosome = 19, start = 18357968, end = 18360987, mart = mart)

chromosome [1] "19" start[1] 18357968 end[1] 18360987 sequence "AGTCCCAGCTCAGAGCCGCAACCTGCACAGCCATGCCCGGGCAAGAACTCAGGACGGTGAATGGCTCTCAGATGCTCCTGGTGTTGCTGGTGCTCTCGTGGCTGCCGCATGGGGGCGCCCTGTCTCTGGCCGAGGCGAGCCGCGCAAGTTTCCCGGGACCCTCAGAGTTGCACTCCGAAGACTCCAGATTCCGAGAGTTGCGGAAACGCTACGAGGACCTGCTAACCAGGCTGCGGGCCAACCAGAGCTGGGAAGATTCGAACACCGACCTCGTCCCGGCCCCTGCAGTCCGGATACTCACGCCAGAAGGTAAGTGAAATCTTAGAGATCCCCTCCCACCCCCCAAGCAGCCCCCATATCTAATCAGGGATTCCTCATCTTGAAAAGCCCAGACCTACCTGCGTATCTCTCGGGCCGCCCTTCCCGAGGGGCTCCCCGAGGCCTCCCGCCTTCACCGGGCTCTGTTCCGGCTGTCCCCGACGGCGTCAAGGTCGTGGGACGTGACACGACCGCTGCGGCGTCAGCTCAGCCTTGCAAGACCCCAGGCGCCCGCGCTGCACCTGCGACTGTCGCCGCCGCCGTCGCAGTCGGACCAACTGCTGGCAGAATCTTCGTCCGCACGGCCCCAGCTGGAGTTGCACTTGCGGCCGCAAGCCGCCAGGGGGCGCCGCAGAGCGCGTGCGCGCAACGGGGACCACTGTCCGCTCGGGCCCGGGCGTTGCTGCCGTCTGCACACGGTCCGCGCGTCGCTGGAAGACCTGGGCTGGGCCGATTGGGTGCTGTCGCCACGGGAGGTGCAAGTGACCATGTGCATCGGCGCGTGCCCGAGCCAGTTCCGGGCGGCAAACATG....

Page 26: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

SNP• Single Nucleotide Polymorphisms (SNPs)

are common DNA sequence variations among individuals.

e.g. AAGGCTAA and ATGGCTAA

• biomaRt uses the SNP mart of Ensembl which is obtained from dbSNP

Page 27: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

getSNP

> getSNP(chromosome = 8, start = 148350, end = 148612, mart = mart)

tsc refsnp_id allele chrom_start chrom_strand1 TSC1723456 rs3969741 C/A 148394 12 TSC1421398 rs4046274 C/A 148394 13 TSC1421399 rs4046275 A/G 148411 14 rs13291 C/T 148462 15 TSC1421400 rs4046276 C/T 148462 16 rs4483971 C/T 148462 17 rs17355217 C/T 148462 18 rs12019378 T/G 148471 19 TSC1421401 rs4046277 G/A 148499 110 rs11136408 G/A 148525 111 TSC1421402 rs4046278 G/A 148533 112 rs17419210 C/T 148533 -113 rs28735600 G/A 148533 114 TSC1737607 rs3965587 C/T 148535 115 rs4378731 G/A 148601 1

Page 28: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Homology mapping

The getHomolog function enables mapping of many types of identifiers from one species to the same or another type of identifier in another species.

Page 29: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

getHomolog> from.mart = useMart("ensembl", dataset = "hsapiens_gene_ensembl")

> to.mart = useMart("ensembl", dataset = "mmusculus_gene_ensembl")

> getHomolog(id = 2, from.type = "entrezgene", to.type = "refseq",

+ from.mart = from.mart, to.mart = to.mart)

V1 V2 V3

1 ENSMUSG00000030111 ENSMUST00000032203 NM_175628

2 ENSMUSG00000059908 ENSMUST00000032228 NM_008645

3 ENSMUSG00000030131 ENSMUST00000081777 NM_008646

4 ENSMUSG00000071204 ENSMUST00000078431 NM_001013775

5 ENSMUSG00000030113 ENSMUST00000032206

6 ENSMUSG00000030359 ENSMUST00000032510 NM_007376

Page 30: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Find (microarray) probes of interest

getFeature function

Filter on: • gene location• symbol• OMIM• GO

Page 31: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

getFeature> getFeature(symbol = "BRCA2", array = "affy_hg_u133_plus_2", mart = mart)

hgnc_symbol affy_hg_u133_plus_21 BRCA2 208368_s_at

> getFeature(chromosome = 1, start = 2800000, end = 3200000, type = "entrezgene", + mart = mart)

ensembl_transcript_id chromosome_name start_position end_position entrezgene1 ENST00000378404 1 2927907 2929327 1406252 ENST00000304706 1 2927907 2929327 1406253 ENST00000321336 1 2970496 2974193 4405564 ENST00000378398 1 2975621 3345045 639765 ENST00000378398 1 2975621 3345045 6478686 ENST00000270722 1 2975621 3345045 639767 ENST00000270722 1 2975621 3345045 6478688 ENST00000378391 1 2975621 3345045 639769 ENST00000378391 1 2975621 3345045 64786810 ENST00000378389 1 2975621 3345045 NA11 ENST00000378388 1 2975621 3345045 NA

Page 32: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

getFeature

Select all RefSeq id’s involved in diabetes mellitus:

>getFeature(OMIM="diabetes mellitus",

type="refseq",

species="hsapiens",

mart=mart)

Page 33: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Ensembl Cross-references

• Powerful function to map between all possible cross-references in Ensembl

• Can for example be used to map between different Affymetrix arrays

Page 34: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Ensembl Cross-references

• getPossibleXrefs– Retrieves all possible cross-references

> xref <- getPossibleXrefs(mart = mart)

> xref[1:10, ]

species xref

[1,] "agambiae" "embl"

[2,] "agambiae" "pdb"

[3,] "agambiae" "prediction_sptrembl"

[4,] "agambiae" "protein_id"

[5,] "agambiae" "uniprot_accession"

[6,] "agambiae" "uniprot_id"

Page 35: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Ensembl Cross-references

>xref = getXref(id="1939_at",

from.species="hsapiens",

to.species = "mmusculus",

from.xref = "affy_hg_u95av2",

to.xref = "affy_mouse430_2",

mart=mart)

Page 36: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

The generic interface:the getBM function

Page 37: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

useDataset> library(biomaRt)

> mart <- useMart("ensembl")

> listDatasets(mart)

dataset version

1 rnorvegicus_gene_ensembl RGSC3.4

2 scerevisiae_gene_ensembl SGD1

3 celegans_gene_ensembl CEL150

4 cintestinalis_gene_ensembl JGI2

5 ptroglodytes_gene_ensembl CHIMP1A

6 frubripes_gene_ensembl FUGU4

7 agambiae_gene_ensembl AgamP3

8 hsapiens_gene_ensembl NCBI36

9 ggallus_gene_ensembl WASHUC1

10 xtropicalis_gene_ensembl JGI4.1

11 drerio_gene_ensembl ZFISH5

....(more)...

> mart <- useDataset(dataset = "hsapiens_gene_ensembl", mart = mart)

Page 38: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

getBM> getBM(attributes = c("affy_hg_u95av2", "hgnc_symbol"),

filter = "affy_hg_u95av2",

values = c("1939_at", "1000_at"),

mart = mart)

affy_hg_u95av2 hgnc_symbol

1 1000_at MAPK3

3 1939_at TP53

mart – an object describing the database connection and the dataset

attributes – the name of the data you want to obtain

filter – the name of the data by which you want to filter from the dataset

values - values

Page 39: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Locally installed BioMarts

• Main use case currently is to use biomaRt to query public BioMart servers over the internet

• But you can also install BioMart server locally, populated with a copy of a public dataset (particular version), or populated with your own data

• Versioning is supported by naming convention

Page 40: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Installation

bioMart depends on R packages Rcurl, XML, which require additional system libraries (libcurl, libxml2)

RMySQL package is optional

Platforms on which biomaRt has been installed:Linux

Mac OS X

Windows

Page 41: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Discussion

Using biomaRt to query public webservices gets you started quickly, is easy and gives you access to a large body of metadata in a uniform way

Need to be online

Online metadata can change behind your back; although there is possibility of connecting to a particular, immutable version of a dataset

Watch this space – implementation of Bioconductor metadata packages is changing and improving! using the familiar packaging and versioning system

Page 42: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

Acknowledgements• EBI

– Wolfgang Huber

– Arek Kasprzyk

– Ewan Birney

– Alvis Brazma

• ESAT-SCD KULeuven– Yves Moreau

– Bart De Moor

• NIH/NHGRI-Sean Davis

• Bioconductor users

Page 43: Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,

BioMart idea …..Phenotypes Expression Genomes

BioMart integration layer

Give me all genes withPhenotype X

Give me all Genes Expressed in brain

Give me all genes associatedwith coding SNPs

Give me all genes withPhenotype X, expressed in brainAnd associated with coding SNPs