EBI is an Outstation of the European Molecular Biology Laboratory.
Ensembl – An Overview
Twitter: #Ensembl
Dr. Giulietta M. Spudich
Ensembl Outreach
EMBL-EBI
This talk …
Genome Sequencing and Browsers
Ensembl Data
Genes
Variation
Comparative Genomics
Regulation
Access
Beginnings …
1995: 1st free-living organism: bacterium
Haemophilus influenzae (1.8 million bp)
2001: First draft of the human sequence (3 gb)2004: ‘Finished’ human sequence
2014: Polished human sequence with haplotypes (GRCh38)
THOMAS POROSTOCKY; SOURCE:
MEETINGZONE
1000 Genomes Project
ENCODE
Today’s genomics - human
COURTESY OF NIH
5 of 24
Today’s genomics – other species
6 of 24
Ensembl – Access to …
7 of 24
Sister project …
Bacteria, Protists, Plants, Fungi, (non-vertebrate) Metazoa
CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTC
CGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCG
TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTGCACTGCTGCGCCTCTGCTG
CGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAGA
TTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCAAAAAAGAACTGCACCTCTGGA
GCGGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAG
AGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATA
AGTCTTAATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG
ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAGAAGAATC
TGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAAAGGAAACCATCTTA
TAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAGGGCTGACTCTGCCGCTGTAC
CAATCTCCTGTAAAAGAATTAGATAAATTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGT
AGACTAAAAGTCTTCGCACAGTGAAAT
CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTC
CGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCG
TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTACTAAAATGGATCAAGCAGAT
GATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG
AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAGTGAAAGT
CCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCA
Raw sequence
Ensembl – unlocking the code
06 March 2014 9
Regulation
Gene
Allele
Conserved
sequence
Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/
• Splice variants, proteins, non-coding RNA
• Small and large scale sequence variation, phenotype associations
• Whole genome alignments, protein trees
• Potential promoters and enhancers, DNA methylation
• User upload, custom data
This talk …
Genome Sequencing and Browsers
Ensembl Data
Genes
Variation
Comparative Genomics
Regulation
Access
Challenge: number of gene/protein sequences increases
11 of 24
• UniProtKB/Swiss-Prot (e.g.Q8IU82) 542,258
• UniProtKB/TrEMBL 51,616,950
• NCBI RefSeq (e.g. NP_006570) 37,371,278
Is there a consensus?
• Reaching a consensus coding sequence set for human and mouse.
• Human 29,045 CCDS IDs -18,683 EnsemblGene IDs (e74)
• Mouse 23,093 CCDS IDs- 19,988 EnsemblGeneIDs (e72)
The GENCODE setwww.gencodegenes.org
13 of 24
• Ensembl has long been respected for its high-quality gene sets
• GENCODE genes = Ensembl Automatic Pipeline + Havana Manual Annotation (+ Yale pseudogenes)
• GENCODE is used by ENCODE, 1000 Genomes, and other projects.
This talk …
Genome Sequencing and Browsers
Ensembl Data
Genes
Variation
Comparative Genomics
Regulation
Access
Ensembl Variation
Aims:
• Collect, integrate and annotate all known variants
• Provide tools for comparison to other genomic data
• Provide a framework for access and to improve understanding
Practical applications of variation
Agriculture, livestock breeding• Disease-, insect-, and drought-resistant crops• Healthier, disease-resistant animals• Marker-assisted breeding• More nutritious produce• Reducing the costs of agricultureAnthropology, evolution, and human migration
Molecular and clinical medicine• Diagnosis, detection and treatment:
– e.g. myotonic dystrophy, fragile X syndrome, inherited colon cancer, familial breast cancer
• Pharmacogenomics "custom drugs"
DNA forensics • Identification of suspects
• catastrophe victims• endangered species
Variation Sources
www.ensembl.org/info/genome/variation/sources_documentation
dbSNP (1000 Genomes, ClinVar, etc) ESP (Exome Sequencing Project)UniProt COSMICHGMD_PublicNHGRI-GWAS& more …
Variation in the Browser
Uses an Ensembl gene set to annotate: SNPs Indels Variants in regulatory regions Structural variants
Publication: McLaren et al. 2010 (Bioinformatics)
Ensembl Variant Effect Predictor
Perl scriptWeb interface REST API
XML
NewInterface!
Ensembl Comparative Genomics
Hom
o_sapiens
Pan_tro
glo
dyte
s
Gorilla
_gorilla
Pon
go
_ab
elii
No
ma
scu
s_le
uco
ge
ny
s
Ma
ca
ca
_m
ula
t ta
Ca
llit hrix
_ja
cch
us
Tars
ius_sy
rich
t a
Mic
roce
bu
s_m
uri
nu
s
Oto
lem
ur_
ga
rne
ttii
Tup
aia
_b
ela
ng
eri
Mu
s_m
uscu
lus
Rat t
us_n
orv
eg
icu
s
Dip
odom
ys_
ord
ii
Cavi
a_p
orc
ellu
s
Ict idom
ys_t
ridece
mlin
eatus
Ory
ctola
gus_cunic
ulus
Ochotona_p
rincepsVicugna_pacos
Tursiops_t runcatus
Bos_taurus
Sus_scrofa
Equus_caballusFelis_catus
Ailuropoda_m elanoleuca
Mustela_putorius_furo
Canis_fam iliaris
Myot is_lucifugus
Pteropus_vampyrus
Erinaceus_europaeus
Sorex_araneus
Loxodonta_africana
Proca
via
_capensis
Echin
ops_te
lfairi
Dasy
pu
s_n
ove
mcin
ctu
s
Ch
olo
ep
us_h
offm
an
ni
Mo
no
de
lph
is_d
om
est ic
aM
acro
pu
s_e
ug
en
iiS
arc
op
hilu
s_h
ar ris
i i
Orn
ith
orh
yn
ch
us_a
na
t in
us
Ga
llu
s_g
allu
sM
ele
ag
ris_g
allo
pa
vo
An
as_p
laty
rhy
nch
os
Tae
nio
pyg
ia_g
ut t
ata
Anolis
_caro
linensi
s
Pelo
dis
cus_
sinensi
s
Xenopus_t r
opicalis
Lat imeria
_chalu
mnae
Oreochro
mis_n
ilot ic
us
Tet ra
odon_nigrovirid
is
Takifugu_rubrip
es
Xiphophorus_maculatus
Oryzias_lat ipes
Gasterosteus_aculeatus
Gadus_m orhua
Danio_rerio
Pet romyzon_marinus
Ciona_savignyi
Ciona_intest inalisDrosophila_m
elanogaster
Caenorhabdit is_elegans
Saccharomyces_cerevisiae
Image obtained using Dendroscope (D.H. Huson and C Scornavacca,
Dendroscope 3: An interact ive tool for rooted phylogenet ic t rees and
networks, Syst emat ic Biology, 2012 )
Whole genome alignments
Homo sapiens ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---ACTGCTGCGCCTCTG-CTGCGCCTCGGGTGTCTTTTGCGGCG
Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCTCTG-CTGCGCCTCGGGTGTCTTTTGCGGCG
Pan troglodytes ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAAGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCACTG-CTGCGCCTCGGGTGTCTTTTGCGGCG
Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCTCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG
Gorilla gorilla gorilla ........................................................................................................................
Ancestral sequences ........................................................................................................................
Pongo abelii ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TGTGCTGCACCTGTG-CTGCGCCTCGGGTCTCTTTTGCGGCG
Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAT-TAGGCG-GCAGAGGCGGAGC--TGCTG-TGGC--------------TCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG
Macaca mulatta ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAT-CAGGCG-GCAGAGGTGGAAC--TGCTGCTGGC--------------TCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG
Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCGTCTGAAAT-GAGGCG-GCAGAGGCGGAGC--TGCTG-TGGC--------------TCTG-CCGCGCCTCGGGTCTTTTCTGCGGCG
Callithrix jacchus ACGT-GG--TCAGCGCGGGCTTGTGGCGCGAGCGTCTGAAAT-GAGGCG-GCAGAGGCGGACC--TGCTG-TGTC--------------TCTG-CCGCGCCTCCGGTCTTTTCTGCGACG
Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TGCTG-CGGCT------------------CCGCGTCTCGGGTCTTTTCTGCGGCA
Mus musculus ACGG-GC--AGAGCGCGGGCTTTTCGCGGGAGCGGGAGCCGT-G----------AGGCGTTGCCGTCAGT-CAGCT-----------------ACCGCTGC-------------------
Ancestral sequences ACGG-GC--AGAGCGCGGGCTTTTCGCGGGAGCGTGAGAAGT-G----------AGGCGGTGCCGTCCGT-CAGCT-----------------ACCGCAAC-------------------
Rattus norvegicus ACGGCGC--AGAGCGCGGGCTTTTCGCAGGAGCGTGAGAAGT-G----------AGGCGGCGCCGTCCGT-CAGCG-----------------GCCGCAAC-------------------
Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TCCTT-CAGCT------------------CCGCGTCTCGGGTCTTTTCTGCGGCA
Oryctolagus cuniculus ACGT-GC--CCAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAA-AAGGCT-ATGGAGGCGGAGC--TCCTT-CAGCT------------------CCGCGTCTGGGGTCTTGCCTAGGGCA
Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TGCTG-CGGCT------------------CCGCGTCTCGGGTCTTTTCTGCGGCA
Bos taurus ACAT-ATCCCGAGAGCAGGCTTTTGGCGCGAGAATCTGAAAC-CCGGTGGGCGGAGGTGCGGC--TGCTG-AAGTTTG----------------C--TGTCTCGGGCGG-T---------
Ancestral sequences ACGT-GCTCCGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCGAGCGGAGGCGGAGC--TGCTG-GGGCTCC----------------C--TGTCTCGGGTGG-TTCTGTGGCA
Canis lupus familiaris ........................................................................................................................
Ancestral sequences ........................................................................................................................
Equus caballus ACGT-GCTCAGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAAGAAGGCAAGCGGAGGCGGAGT--TGCTG-GGGCTCC----------------C--TGACTGGGGTGG-TTGTGTGGCA
Great apes
Old world
monkeys
Primates
Glires
Rodents
Laurasiatheria
Boroeutherian
This talk …
Genome Sequencing and Browsers
Ensembl Data
Genes
Variation
Comparative Genomics
Regulation
Access
Gene expression:The basic model
Transcription Factor Binding Sites Promoter Gene
mRNA
Transcription Factors Activation
Repression RNA polymerase complex
2 nm
Available data
Regulation (ENCODE + …)
This talk …
Genome Sequencing and Browsers
Ensembl Data
Genes
Variation
Comparative Genomics
Regulation
Access
Open source- access our data!
• Ensembl Views (Website, ftp)
• Ensembl Database (Perl API, REST API, MySQL)
• BioMart – Quick Data Retrieval (Web interface , Bioconductor, Galaxy, BioMaRt)
Ensembl is used worldwide
Top users:
UK
US
Canada
China
France
Germany
Italy
Japan
Spain
EBI is an Outstation of the European Molecular Biology Laboratory.
Workshops Worldwide (2013)
EBI is an Outstation of the European Molecular Biology Laboratory.
What’s coming? (2014)
New Assemblies:
• GRCh38 (and all the updated annotation)www.ensembl.info/blog (category GRCh38)
• Baboon
• Vervet monkey
• Amazon molly
• Crab eating macaque (Pre.ensembl.org)
• Hedgehog (Pre.ensembl.org)
New BLAST
New Regulatory Buildwww.ensembl.info/blog/2013/12/26/the-new-ensembl-regulatory-annotation
Learn more
• Comments and questions? [email protected]
• YouTube channel www.youtube.com/user/EnsemblHelpdesk
• Mailing lists [email protected], [email protected]
• Courses online www.ensembl.info/ecourse
• Our tutorials page www.ensembl.org/info/website/tutorials
Follow us• Facebook www.facebook.com/Ensembl.org
• Twitter https://twitter.com/Ensembl
• Come visit our blog! www.ensembl.info
Acknowledgements
FundingEuropean Commission Framework Programme 7