data integration, gene ontology, and the mouse* joel richardson, ph.d. mouse genome informatics...
TRANSCRIPT
Data Integration, Data Integration, Gene Ontology, Gene Ontology, and the Mouse*and the Mouse*
Joel Richardson, Ph.D.Joel Richardson, Ph.D.
Mouse Genome Informatics GroupMouse Genome Informatics Group
The Jackson LaboratoryThe Jackson Laboratory
Bar Harbor, Maine 04609Bar Harbor, Maine 04609
* Not necessarily in that order.
We have the human We have the human sequence: OK, sequence: OK, nownow what? what?
One species is not enough:One species is not enough: model organisms (one strain is not enough)model organisms (one strain is not enough) comparative studiescomparative studies
The sequence is just the beginningThe sequence is just the beginning sequence variantssequence variants gene regulation and interaction networksgene regulation and interaction networks non-coding functional elementsnon-coding functional elements environmental effectsenvironmental effects
Genotype to phenotypeGenotype to phenotype
The MouseThe Mouse
the premier animal model for the premier animal model for studying human diseasestudying human disease
> 95% same genes> 95% same genes same diseases, similar reasons same diseases, similar reasons
(e.g., cancer, hypertension, (e.g., cancer, hypertension, diabetes, osteoporosis, …)diabetes, osteoporosis, …)
1000s lab strains, diff. 1000s lab strains, diff. characteristicscharacteristics
precise genetic controlprecise genetic control
The Jackson The Jackson LaboratoryLaboratory
Private nonprofit research Private nonprofit research institution (est. 1929)institution (est. 1929)
Studying mouse as a model of Studying mouse as a model of human biology and diseasehuman biology and disease
National Cancer Research National Cancer Research CenterCenter
Supplier of laboratory strains to Supplier of laboratory strains to researchers worldwideresearchers worldwide
Areas: metabolism, Areas: metabolism, development, cancer, immune development, cancer, immune responseresponse
www.jax.org
Bar Harbor, ME 04609Bar Harbor, ME 04609
Mouse Genome Mouse Genome Informatics (MGI)Informatics (MGI)
Consortium of NIH-funded projects Consortium of NIH-funded projects Housed at TJLHoused at TJL Integrates and disseminates public Integrates and disseminates public
data resources covering selected data resources covering selected aspects of mouse biologyaspects of mouse biology
First program project funding 1989First program project funding 1989 > $10M/y total, >60 people> $10M/y total, >60 people Online since 1994.Online since 1994.
www.informatics.jax.orgwww.informatics.jax.org
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
MGI Concept MapMGI Concept Map
Genes and other loci
ExpressionData
MappingData
MolecularFragments
DNA and Protein
Sequences
Strains
Phenotypes
AnatomyGenotypes Alleles
References
AccessionIDs
Variants
Integration in MGIIntegration in MGI
Identifying objects.Resolving or notingdiscrepancies.
Integration is key to Integration is key to knowledge discoveryknowledge discovery in age of genomicsin age of genomics
The Power Of Integration: The Power Of Integration: QueriesQueries
What transcription factors are expressed in a 2-cell What transcription factors are expressed in a 2-cell embryo and embryo and notnot in a blastocyst? in a blastocyst? integration of multiple expression assay data sets and data types.integration of multiple expression assay data sets and data types. standardization of anatomical references and developmental standardization of anatomical references and developmental
stagesstages What development QTLs contain these TFs?What development QTLs contain these TFs?
integration of expression data and mapping dataintegration of expression data and mapping data genetic map result of integrating lots of mapping datagenetic map result of integrating lots of mapping data
What strains are distinguished by SNPs in this region?What strains are distinguished by SNPs in this region? And so on…And so on…
The MGI System The MGI System (from 40,000 feet)(from 40,000 feet)
MGIRDBMS
Web Files
Data Downloads
Literature Curation
SQL
Load scripts
Editing Interface
Servlets CGI ScriptsFiles
Report Scripts
MGI in ContextMGI in Context
MGI dbScientific
Literature
Mutagenesis
Centers
GenBank
LocusLink
Unigene TIGRDoTS
OMIM
Ensembl
GO
Interpro
SwissProt
ATCC
RIKEN
Anatomy
RPCI
RatMap
NIA
MGC
I.M.A.G.E.NCBI
RefSeq
Integration relies on Integration relies on Standard VocabulariesStandard Vocabularies
Structured vocabulariesStructured vocabularies The common semantic frameworksThe common semantic frameworks Structured into is-a/part-of hierarchiesStructured into is-a/part-of hierarchies
Evidence-based annotationEvidence-based annotation Associations of vocabulary terms with Associations of vocabulary terms with
objectsobjects Evidence (codes), citations, etc., Evidence (codes), citations, etc.,
decorate the associationsdecorate the associations Structured annotations and queriesStructured annotations and queries
Structured Vocabularies Structured Vocabularies in MGIin MGI
Gene Ontology (GO)Gene Ontology (GO) Functional gene annotationsFunctional gene annotations
Mammalian Phenotype (MP)Mammalian Phenotype (MP) Annotations to genotypes (e.g. knockouts)Annotations to genotypes (e.g. knockouts)
Mouse Anatomical DictionaryMouse Anatomical Dictionary Annotations of expressionAnnotations of expression
Other standardized, non-structured vocabulariesOther standardized, non-structured vocabularies Mouse strainsMouse strains cell linescell lines clone librariesclone libraries tissuestissues lots of smaller oneslots of smaller ones
ChallengesChallenges Domain very difficult to frameDomain very difficult to frame Huge variability, variety of data, formats, Huge variability, variety of data, formats,
providors, update providors, update schedulesschedules&semantics, &semantics, etc…etc…
Biologists and Computer Scientists think Biologists and Computer Scientists think differently.differently. communication is paramount, but difficultcommunication is paramount, but difficult
Rapid changes, e.g., in last 10 years:Rapid changes, e.g., in last 10 years: genetic crosses -> YAC/BAC mapping -> RH genetic crosses -> YAC/BAC mapping -> RH
mapping -> genome sequence mapping -> genome sequence northern blots -> microarrays -> mpssnorthern blots -> microarrays -> mpss
System EvolutionSystem Evolution
The system is a software The system is a software ecosystemecosystem
Maintenance is the cost of Maintenance is the cost of successsuccess
Changes and cost/benefitChanges and cost/benefit If it ain’t broke, don’t fix itIf it ain’t broke, don’t fix it Commitments/agenda/prioritiesCommitments/agenda/priorities
CreditsCreditsRichard BaldarelliMatt BayaJon BealDale BegleyJudy BlakeJohn BoddyDirck BradtCarol BultNancy ButlerDonna BurkartJeff CampbellLori CorbaniRebecca CoreySharon CousinsDiane DahmenHarold DrabkinJanan EppigJackie FingerDavid Garippa
Lucette GlassCarroll GoldsmithPat GrantTerry HayamizuDavid HillJim KadinBen KingDebbie KrupkeMoyha Lennon-PierceJill LewisIra LuCathy LutzLois MaltaisPrita ManiMike McCrossinLouise McKenzieDavid MiersDaniel ModrusanDieter Naf
Li NiJanice OrmsbySridhar RamachandranDeborah ReedJoel RichardsonMartin RingwaldDavid ShawBob SinclairCynthia SmithConnie SmithPaul SzauterLeslie TrombleyPierre Vanden BorreMichael WalkerLinda WashburnJosh WinslowIry WithamSophia Zhu