ensembl compara perl api stephen fitzgerald stephenf/edinburgh-workshop/ ebi - wellcome trust genome...
TRANSCRIPT
Ensembl Compara Perl API
Stephen Fitzgeraldhttp://www.ebi.ac.uk/~stephenf/edinburgh-workshop/
EBI - Wellcome Trust Genome Campus, UK
compara
What is Ensembl Compara?
A single database which contains precalculated comparative genomics data
Access via perl API and mysql
A production system for generating that database(not in this presentation)
Compara dataRaw genomic sequenceWhole genome alignments (tBLAT, BlastZ-net, PECAN)
46 species in Ensembl release-52
Syntenic regions (based on BlastZ-net)
Raw Protein Alignments Protein Family clusters
Protein treesGene orthology / paraology predictions
Compara database & the Ensembl core databases
Since there is minimal primary data inside Compara, to gain full access to the data external links with core DBs must be re-established
Example: compara_52 must be linked with theEnsembl core_52 databases
Proper REGISTRY configuration is criticalOr load_registry_from_db is probably the best choice here
Written in Object-Oriented Perl
Used to retrieve data from and store data into ensembl-compara database
Generalized to extend to non-ensembl genomic data (Uniprot)
Follows same ‘Data Object’ & ‘Object Adaptor’ DBAdaptor design as the other Ensembl APIs
The Compara Perl API
Compara object model overview
NCBITaxon
GenomeDB
DnaFrag Member
MethodLinkSpeciesSet
GenomicAlign
GenomicAlignBlock SyntenyRegion
DnaFragRegion
Homology Family
PR
IMA
RY
DA
TA
AN
AL
YS
ISR
ES
UL
TS
Attribute
ProteinTree
AlignedMember
Primary data
GenomeDB: relates to a particular Ensembl core DB name(), assembly(), genebuild(), taxon() fetch_by_name_assembly(), fetch_by_registry_name(),
fetch_by_Slice(), fetch_all()
DnaFrag: represents a “top level” SeqRegion name(), length(), genome_db(), slice(), coord_system_name() fetch_by_Slice(), fetch_by_GenomeDB_and_name()
Member: list all Ensembl genes + SwissProt + SPTrEMBL source_name(), stable_id(), genome_db(), taxon(), sequence(),
get_all_peptide_Members(), get_longest_peptide_Member(), gene_member()
fetch_by_source_stable_id()
Analysis MethodLinkSpeciesSet provides a handle to isolate
specific data from the shared tables (homology, genomic_align_block)
MethodLink: Each individual analysis in compara is tagged with a unique name called a method_link_type
BLASTZ_NET, TRANSLATED_BLAT, PECAN, SYNTENY, FAMILY, ENSEMBL_ORTHOLOGUES, ENSEMBL_PARALOGUES, PROTEIN_TREES
SpeciesSet: the sets of species as (a ref. to) an array of GenomeDBs
fetch_by_method_link_type_GenomeDBs(), fetch_by_method_link_type_registry_aliases()
name(), method_link_type(), species_set(), source()
Exerciseshttp://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html
GenomeDB1. Find out the versions of human and mouse genomes in the database2. Print the name of all the GenomeDBs in the database
DnaFrag1. Get the DnaFrag for the chromosome 1 of the macaque genome(using a genome_db object as an argument)2. Get the DnaFrag for the chromosome X of the mouse genome(using a core slice object as an argument)
MethodLinkSpeciesSet1. Find out how many analyses are stored in the database2. Get the name of the MethodLinkSpeciesSet corresponding to the BlastZ-net analysis for human and mouse3. Get the names of the all the species using the mlss corresponding to the Pecan analyses
GenomeDB example code
use strict;use Bio::EnsEMBL::Registry;my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous");
my $genome_db_adaptor = $reg->get_adaptor( "Multi", "compara", "GenomeDB");
my $genome_db = $genome_db_adaptor-> fetch_by_registry_name("human");
print “Name :”,$genome_db->name, "\n";print “Assembly :”,$genome_db->assembly, "\n";print “GeneBuild :”,$genome_db->genebuild, "\n";
GenomeDB example code
$> perl genome_db1.pl
Homo sapiens NCBI36 2006-08-EnsemblMus musculus NCBIM36 2006-04-Ensembl
DnaFrag example codeuse strict;use Bio::EnsEMBL::Registry;my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous");
my $genome_db_adaptor = $reg->get_adaptor( "Multi", "compara", "GenomeDB");
my $genome_db = $genome_db_adaptor-> fetch_by_registry_name("human");
my $dnafrag_adaptor = $reg->get_adaptor( "Multi", "compara", "DnaFrag");
my $dnafrag = $dnafrag_adaptor-> fetch_by_GenomeDB_and_name($genome_db, "13");
print "Name :", $dnafrag->name, "\n";print "Length :", $dnafrag->length, "\n";print "CoordSystem :", $dnafrag->coord_system_name, "\n";
DnaFrag example code
$> perl test1.plName :13Length :114142980CoordSystem :chromosome
MethodLinkSpeciesSetexample code
use strict;use Bio::EnsEMBL::Registry;my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous");
my $mlssa = $reg->get_adaptor("Multi", "compara", "MethodLinkSpeciesSet");
my $mlss = $mlssa-> fetch_by_method_link_type_registry_aliases( "BLASTZ_NET", ["human", "mouse"]);
print $mlss->name, "\n";
print "type: ", $mlss->method_link_type, "\n";
my $species_set = $mlss->species_set();
foreach my $this_genome_db (@$species_set) { print $this_genome_db->name(), "\n";}
MethodLinkSpeciesSetexample code
$ > perl method_link_species_set.pl H.sap-M.mus blastz-net (on H.sap)
Genomic Alignments
BlastZ-Net used to compare closely related pair of species BlastZ-raw -> BlastZ-chain -> BlastZ-net
Translated BLAT used to compare more distant pair of species
Pecan multiple global alignments all vs all coding exons wublastp -> Mercator ->
Pecan on each syntenic block
GenomicAlignBlock GenomicAlignBlock
represents a genomic alignment contains 1 GenomicAlign per sequence fetch_all_by_MethodLinkSpeciesSet_Slice($mlss,$slice) Methods:
method_link_species_set(), score(), length(), perc_id(), get_all_GenomicAligns(), get_SimpleAlign()
GenomicAlign dnafrag(), genome_db(), get_Slice(), dnafrag_start,
dnafrag_end(), dnafrag_strand(), aligned_sequence()
GenomicAlignBlock$all_GAlign = $GABlock->get_all_GenomicAligns() $arrayref$Simplealign = $GABlock->get_SimpleAlign() $object
$Simplealign: a bioperl object which contains the whole alignment - can be printed in various format using bioperl modules
$Galign: an object which represents one of the sequences in the alignment only
Hsap.X.1223-1230: ACCTTC-A <- $gaCfam.X.1390-1395: ACC--CGA <- $ga
Synteny Based on BlastZ-net alignments
SyntenyRegionAdaptor fetch_all_by_MethodLinkSpeciesSet_Slice(),
fetch_all_by_MethodLinkSpeciesSet_DnaFrag() Methods:
get_all_DnaFragRegions(), method_link_species_set(),
DnaFragRegion slice(), dnafrag(), dnafrag_start(), dnafrag_end(),
dnafrag_strand()
Exerciseshttp://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html
GenomicAlignBlock1. Fetch all the BLASTZ_NET alignments between the first 130K nucleotides of the human chromosome X and the mouse genome.2. Print the exact location of the alignment blocks.3. Compare the original and the aligned sequences.4. Find the BLASTZ_NET alignments between human gene BRCA2 and the mouse genome.5. Print the BLASTZ_NET alignments between the rat gene ECSIT and the mouse genome.6. Print the PECAN multiple alignments between the rat gene ECSIT and 11 other amniote vertebrates.7. Print the constrained-element alignments within the rat ECSIT locus (use the constrained elements generated from the 12-way alignments).
Synteny1. Get the human-mouse syntenic map for human chromosome X.
GenomicAlignBlock example code[...]my $slice_adaptor = $reg->get_adaptor( "human", "core", "Slice");my $slice = $slice_adaptor-> fetch_by_region("chromosome", "12", 1e4, 2e4);
my $gaba = $reg->get_adaptor("Multi", "compara", "GenomicAlignBlock");
my $genomic_align_blocks = $gaba-> fetch_all_by_MethodLinkSpeciesSet_Slice( $method_link_species_set, $slice);
foreach my $this_gab (@$genomic_align_blocks) {
my $all_gas = $this_gab->get_all_GenomicAligns(); foreach my $this_ga (@$all_gas) { print $this_ga->genome_db->name(), ":", $this_ga->get_Slice()->name(), "\n"; print $this_ga->aligned_sequence(), "\n"; } print "\n";}
GenomicAlignBlock example code
$>perl gab.plMus musculus:chromosome:NCBIM37:6:121449987:121450302:-1CCTCTTAATAAACATTATTGTCAA[…]Homo sapiens:chromosome:NCBI36:12:19128:19507:1CCTCTTAATAAGCACACATATCCT[..]
Synteny example code[...]my $synteny_region_adaptor = $reg->get_adaptor( "Multi", "compara", "SyntenyRegion");
my $synteny_regions = $synteny_region_adaptor-> fetch_all_by_MethodLinkSpeciesSet_Slice( $human_mouse_synteny_method_link_species_set, $human_slice);
foreach my $this_synteny_region (@$synteny_regions) {
my $these_dnafrag_regions = $this_synteny_region->get_all_DnaFragRegions();
foreach my $this_dnafrag_region (@$these_dnafrag_regions) {
print $this_dnafrag_region->dnafrag-> genome_db->name, ": ", $this_dnafrag_region->slice->name, "\n"; } print "\n";}
Homology
(e! 38): Orthologue predictions based on ‘best reciprocal
blast hits’ Paralogues for a selected set of species No global view of the evolution history of the
gene considered
e! 39+: Orthologues and paralogues are inferred from
protein trees Phylogeny: Orthology/Paralogy in one go
BSR: Blast Score Ratio. When 2 proteins P1 and P2 are compared, BSR=scoreP1P2/max(self-scoreP1 or self-scoreP2). The default threshold used in the initial clustering step is 0.33.
Homology types
Homology Homology object
contains 1 pair of Member/Attribute per gene/protein
fetch_all_by_Member(), fetch_all_by_MethodLinkSpeciesSet(), fetch_all_by_Member_MethodLinkSpeciesSet()
Methods:
method_link_species_set(), description(), subtype(), perc_id(), get_all_Member_Attribute(), get_SimpleAlign()
Family
Compara compute gene family clusters
Runs on all Ensembl transcripts plus all Uniprot/SWISSPROT and Uniprot/SPTREMBL metazoan proteins
The algorithm is based on :
All vs all blastpMCL clusteringMuscle multiple aligner
Results stored in family, family_member tables
Family Family object
contains 1 pair of Member/Attribute per gene/protein
fetch_all by_Member()
Methods:
method_link_species_set(), description(), description_score(), get_all_Member_Attribute(), get_SimpleAlign()
Exerciseshttp://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html
Members1. Find the Member corresponding to SwissProt protein O932792. Find the Member for the human gene BRCA23. Find all the peptide Members corresponding to the human gene CTDP1
Homology1. Get all the predicted homologues for the human gene BRCA22. Get all the mouse orthologues predicted for the human gene CTDP1
Family1. Get family predicted for the human gene BRCA22. Get the alignments corresponding to the family of the human gene HBEGF
Member example codeuse strict;use Bio::EnsEMBL::Registry;my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous");
my $member_adaptor = $reg->get_adaptor( "Multi", "compara", "Member");
my $member = $member_adaptor-> fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG00000000971");
print "All proteins:\n";my $all_peptide_members = $member-> get_all_peptide_Members();
foreach my $this_peptide (@$all_peptide_members) { print $this_peptide->stable_id(), "\n";}
Member example code
$> perl test2.pl All proteins:ENSP00000356399ENSP00000356398ENSP00000352658
Homology example code[...]my $ma = $reg->get_adaptor( "Multi", "compara", "Member");my $member = $ma->fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG00000000971");
my $homology_adaptor = $reg->get_adaptor( "Multi", "compara", "Homology");
my $homologies = $homology_adaptor-> fetch_all_by_Member($member);
foreach my $this_homology (@$homologies) { print $this_homology->description, "\n"; my $member_attributes = $this_homology-> get_all_Member_Attribute(); foreach my $this_mem_attr (@$member_attributes) { my ($this_member, $this_attribute) = @$this_mem_attr; print $this_member->genome_db->name, " ", $this_member->source_name, " ", $this_member->stable_id, "\n"; } print "\n";}
Family example code[...]my $ma = $reg->get_adaptor( "Multi", "compara", "Member");my $member = $ma->fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG00000000971");
my $family_adaptor = $reg->get_adaptor( "Multi", "compara", "Family");my $families = $family_adaptor-> fetch_all_by_Member($member);
foreach my $this_family (@$families) { print $this_family->description, "\n"; my $member_attributes = $this_family-> get_all_Member_Attribute(); foreach my $this_mem_attr (@$member_attributes) { my ($this_member, $this_attribute) = @$this_mem_attr; print $this_member->taxon->binomial, " ", $this_member->source_name, " ", $this_member->stable_id, "\n"; } print "\n";}
Getting More Information
perldoc – Viewer for inline API documentation. shell> perldoc Bio::EnsEMBL::Compara::GenomeDB shell> perldoc Bio::EnsEMBL::Compara::DBSQL::MemberAdaptor
online at: http://www.ensembl.org/ Tutorial document:
cvs: ensembl-compara/docs/ComparaTutorial.pdf ensembl-dev mailing list:
[email protected] Exercise solutions:
http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/solutions.html
Ensembl-dev mailing list and HelpDesk
ensembl-dev mailing list is great for questions around the API and the DB
HelpDesk is very helpful
Give detailed info on what you are trying to do
Check that you have the modules installed ($PERL5LIB pointing to them)
Guy Coates, Tim Cutts, Shelley GoddardSystems & Support
Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel RiosFunctional Genomics
Ewan Birney (EBI), Tim Hubbard (Sanger Institute)Leaders
Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Benedict Paten, Daniel ZerbinoResearch
Martin Hammond, Dan Lawson, Karyn MegyVectorBase Annotation
Kerstin Jekosch, Mario Caccamo, Ian SealyZebrafish Annotation
Val Curwen, Steve Searle, Browen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Jan-Hinnerck Vogel, Kevin Howe, Felix Kokocinski, Stephen Rice, Simon White
Analysis and Annotation Pipeline
Javier Herrero, Kathryn Beal, Benoît Ballester, Stephen Fitzgerald, Albert Vilella, Leo GordonComparative Genomics
James Smith, Fiona Cunningham, Anne Parker, Steve Trevanion (VEGA)Web Team
Xosé M Fernández, Bert Overduin, Giulietta Spudich, Michael SchusterOutreach
Eugene KuleshaDistributed Annotation System (DAS)
Arek Kasprzyk, Damian Smedley, Richard Holland, Syed HaldarBioMart
Glenn Proctor, Ian Longden, Patrick Meidl, Andreas KähäriDatabase Schema and Core API
Ensembl TeamEnsembl Team
A special case of ortholog